In today’s digitally driven world, enterprises need to find ways to extract valuable insights from huge amounts of data. That data is both generated within the organization and captured from external sources, from every application and social media to the networks of sensors that constitute the Internet of Things (IoT).
To make sense of these data assets, enterprises need automated tools that continually analyze data and generate information and insights that business leaders can use to keep the organization competitive. That’s the idea behind machine learning.
Machine learning is a branch of artificial intelligence (AI) that involves building systems that learn iteratively from data, identify patterns, and predict future results—all with minimal human intervention. Machine learning is often used in data science, alongside tools such as statistical modeling, data mining, information retrieval, and natural language processing.
Machine learning can do things that would be virtually impossible to do with manual methods. A machine learning classifier can sort through millions of pieces of information, or “features,” to tell you how to identify an event or object. For example, is a series of actions on a computer network from a legitimate user, or is it a cyberattack? A machine learning clustering algorithm can tell you how to organize things into meaningful groups. Machine learning algorithms can also find unusual patterns in data and predict the next data point in a time series.
Today, you will often see the word “scalable” used before “machine learning.” Scalable machine learning, also known as distributed machine learning, refers to algorithms and infrastructure that scale out to capture insights from huge amounts of data.
Why is this important? Let’s take an example from the healthcare industry. A health system’s electronic health record (EHR) contains detailed data on patients; this data can be mined to predict which patients are at risk for undesirable health outcomes. In years past, I was part of a team that worked with healthcare organizations to do just this.
In one example, the goal was analyze the data for heart failure patients, who are some of the most fragile and difficult-to-manage patients in the health system. The first step involved identifying a pool of patients to analyze, so we selected patients with heart failure as a diagnosis in their EHR problem list. To our dismay, when we investigated these patients further, we found that 30% of them did not have heart failure. For these patients, the heart failure diagnosis was a false positive.
How do you identify real heart failure patients if one-third of the time their diagnosis codes are false positives? Build a machine learning classifier! We used text data from millions of doctors’ notes to train a machine learning classifier to pick out heart failure patients based on everything in their charts, not just their diagnosis codes. An interesting fact about this classifier: it did not diagnose heart failure. Instead, it identified the patients whose charts had the characteristics of a heart failure patient.
It turned out that diagnosis codes in EHRs are fraught with accuracy problems, so we developed machine learning tools to automate the detection of both false positives and missing codes in patient charts. Then we created a workflow that presented medical coders with the conclusions of the algorithms, along with the evidence that these conclusions were based on, so they could validate each finding. Whenever a coder disagreed, we could use this new information to retrain the models and improve their accuracy.
Here’s where this story gets really exciting to me. These algorithms, each of which was refined in the course of examining hundreds or thousands of records, were then applied to millions of patient records to improve outcomes with better data. The resulting information could literally save lives, and the hospital could never have gotten there by trying to manually examine the records. In this case, machine learning allowed data scientists to leverage a relatively small number of examples to gain insights into millions of patient records.
For a closer look at the life-saving power of big data analytics in healthcare, you can read a recent Intel case study that covers the use of machine learning at Penn Medicine. Among other insights, smart algorithms helped Penn Medicine determine that 20% to 30% of heart failure patients had not been properly identified in the EHR as heart failure patients. By identifying these vulnerable patients before they were discharged from the hospital, it was easier to schedule additional follow-up actions.
This, of course, is just one of countless examples of the potential insights big data analytics and machine learning can provide. In a subsequent post I will take up an associated technical challenge: optimizing machine learning algorithms to run faster and more efficiently in a distributed computing environment.
Bob Rogers is the Chief Data Scientist for Big Data Solutions at Intel.
©2016 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.