In an earlier post I explored the value of using scalable machine learning to extract value from huge amounts of data. In this post, I will dive down into the technical side of things, particularly the challenges and benefits that come with making algorithms scalable on large clusters of computers.
Machine learning algorithms are written to run on single-node systems, or on specialized supercomputer hardware, which I’ll refer to as HPC boxes. They grew up in a world where they didn’t have to scale across multiple nodes. It’s relatively easy to get high performance when running algorithms on a single computer. With distributed computing, things get a great deal harder for some algorithms due to the communications latencies among what could be thousands of server nodes.
So why not run your machine learning algorithms on specialized hardware? One reason: HPC boxes have their own challenges, one of which is that they run out of capacity. However, with a scalable architecture for big data analytics, you can add nodes to scale out as your capacity needs grow. Furthermore, the total cost of ownership for dedicated, custom hardware can be much higher than scale-out architectures built with commodity servers in private or public clouds.
Today, the trend is to run machine learning algorithms in scale-out clusters like those used for big data applications in cloud data centers. This shift to scalable infrastructure brings the added advantage of integrating your machine learning into the same environment as your other analytics workloads.
With scalable machine learning, algorithms might run in a cluster of thousands of servers, with each node potentially analyzing a different dataset. The distributed nature of the environment requires models to synchronize between nodes as they go, so each knows what the other is doing. Big data systems usually divide huge datasets into chunks, one per computer in a cluster. To build a model for all the data, each computer calculates an answer for the data it has, and then all the results are combined to create a model that is informed by the data from all of the nodes in the cluster. Some machine learning algorithms are relatively easy to scale this way, because there is a well-defined arithmetic for combining the individual results into a global result. For example, the average of an entire data set is the weighted average of the averages from each of the computers. Simple.
In machine learning, the algorithms are more complex, so they can’t be aggregated in such a simple way; rather, a clever algorithmic approach is needed. For example, to compute the Cox Proportional Hazards model at scale requires ordering the data in a very specific way, and then computing and keeping track of a number of subparts of the computation from each node, so that they can be efficiently combined to create an accurate aggregate result. This algorithm is important for many quality assurance and research applications, so making it scalable has definite utility to industry.
Algorithms written for an earlier era are unaware of the communications overhead brought by constant server-to-server communications. As a result, they tend to communicate much more information than they need to communicate as they work to keep models in sync. That adds latency throughout the cluster. To get high performance with scalable machine learning you have to optimize algorithms to run models more efficiently in a distributed computing environment. Algorithms have to become much more cluster aware, so they communicate only essential information between nodes. They have to be parsimonious about the information they transfer over the network, to eliminate any unnecessary communications between the layers of the network.
The scalability of algorithms becomes even more challenging when you move to deep learning, which is a subset of machine learning that can automatically identify important attributes of huge amounts of complex data and then use these attributes to perform many valuable tasks, such as image recognition and speech-to-text conversion.
Deep learning systems have a hierarchy of compute levels or layers, each of which builds on the other. For example, to recognize specific objects in images, deep learning systems can be trained to identify complex patterns and important differences in many millions of images. Some layers in the system recognize these features and others combine them in successively complex ways, culminating in very sophisticated pattern recognition capabilities. Because successive layers build upon the outputs of earlier layers, there are many points of communication within a deep learning network.
To reduce latency, the algorithms running in scalable deep learning systems need to be optimized to avoid inefficiencies in communications between the layers of the system. This optimization becomes all the more important when you consider that we appear to be on the cusp of an explosion in the use of scalable deep learning. Indeed, we are seeing rapid progress in the evolution of deep learning algorithms in scalable computing environments as experts discover and implement new deep learning network topologies and new methods for efficiently spreading the computational burden of training deep networks.
To wrap things up, let’s look at the big picture. Data analytics, including machine learning, is becoming mission critical to nearly all organizations, from retail to manufacturing to healthcare. The availability, convenience, and cost-effectiveness of scale-out computing resources has allowed the consumers of data analytics to ask increasingly complex questions from larger and larger data sets. To answer many of these questions, users require machine learning.
The ubiquitous use of machine learning at scale is driving new advances in scalable machine learning algorithms that make them much more practical to run on enormous data sets. We have seen significant advances in the efficiencies of traditional machine learning algorithms on scalable systems.
The last frontier has been the optimization of deep learning algorithms for scalable computing environments, and we are just now seeing significant advances in this area. To make the most of the opportunity, we must focus on the optimization of the algorithms, to ensure that they don’t cause unnecessary drag on performance and can co-exist with all of our data analytics infrastructures.
Bob Rogers is the Chief Data Scientist for Big Data Solutions at Intel.
©2016 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
 Thanks to Anahita Bhiwandiwalla, Ph.D., for providing this example.