Practical Infrastructure Convergence: Building a System that Does It All

Recent changes in system architecture are enabling organizations to seamlessly use a single system for all their needs.

shutterstock 1017251209 1280x1280
Dell EMC

In a previous blog, I presented seven trends that are the drivers for the convergence of high performance computing (HPC) and artificial intelligence (AI) workloads. In this post, I would like to present a more practical aspect of this convergence and demonstrate how changes in system architecture, including the software, are enabling the next generation of users to seamlessly use a single system for all their needs.

To this end, Intel, the maker of Intel® Xeon® processors, has been actively working on management software that makes it considerably easier to build and use a system that does it all: Intel Magpie Select Solution for HPC & AI workloads. More on this in a bit. First, let’s understand why there is a need for something like Magpie in the first place.

Differences in HPC, data analytics and AI systems architectures

figure 1 hpc cluster v2 Dell EMC

A simplified HPC cluster is shown in Figure 1. We often joke that no two HPC clusters are alike. (The needs of no two HPC users are ever alike, either.) However, they often share several components.

The main portion of the cluster is comprised of the compute nodes. These nodes are often identical, although some may be notably different in that they may have more CPU sockets and memory (fat nodes) or might have FPGAs or GPUs (accelerator nodes). Regardless of these differences, however, all compute nodes are interconnected to one another (and storage) using a high-speed network or an interconnect, such as Intel Omni-Path Architecture (Intel OPA).

Many HPC workloads are essentially a large job split into smaller chunks that map to each of the compute nodes. Because of this, there is a lot of cross communication between the compute nodes. This traffic is why a dedicated high-speed interconnect is needed.

A separate, slower management network also is present, but this is used by special management nodes for assessing the health and resources of each of the compute nodes. Cluster management software like Bright Cluster Manager or OpenHPC will utilize this management network to provision the hardware, among other functions, using the dedicated management nodes.

Another uniquely defining feature of an HPC cluster is high-speed storage. HPC workloads are such that compute nodes generate a lot of transient files (think scratch), and these files may be read by one or all of the nodes simultaneously. A typical storage filesystem is simply not designed to handle this kind of data volume and can present a bottleneck.

Most HPC systems employ a parallel filesystem, with Lustre and GPFS being the two most popular ones. These parallel filesystems share the same high-speed interconnect with the compute nodes and, thus, form a very efficient way of sharing data between nodes. Usually a much larger, but slower, persistent storage based on NFS (or another filesystems, like OneFS by Isilon) is used behind the parallel filesystem.

In any HPC system, there are two major software components: cluster management software (mentioned above) and job scheduling software. Most HPC workloads are run as batch jobs (although some are run interactively like visualization). The function of the scheduler is to match compute resources on the cluster to the requirements of the job and to keep the cluster utilization at a maximum.

The simplest example would be when the job requires all the compute nodes on a cluster, and this does happen quite frequently. However, more often, you have smaller jobs that only require a fraction of the compute resources, and this is where the job scheduler comes in. It mixes and matches jobs of various sizes and tries to run as many jobs simultaneously as possible, so compute resources aren’t wasted. Modern schedulers are smart enough to also match an AI job to a node that has an accelerator.

In smaller systems, the scheduler can run on the management node. However, most larger systems have separate login nodes where users can log in and push their jobs onto the scheduler queue.

Over the years, HPC job schedulers have evolved along with hardware architectures and, thus, the various components in an HPC system that are mentioned above have come to be expected. The Simple Linux Utility for Resource Management, or SLURM, is a popular open-source tool with which most HPC system administrators are familiar, although there are several other both open- and closed-source resource managers that are equally popular in the HPC world.

Now, let’s see what a typical data analytics (DA) cluster looks like.

figure 2 hadoop cluster Dell EMC

Figure 2 shows a typical Hadoop cluster, which is popular for all kinds of machine learning (ML) workloads. The basis for the difference in hardware architectures between HPC clusters and data analytics clusters is that the former is compute-centric, while the latter is data-centric. This simple difference has given rise to a disparity in how the respective architectures and types of software have evolved over time.

There are some superficial similarities with an HPC cluster. For example, there is a management node with a dedicated management network, and there are multiple worker nodes with maybe even a high-speed interconnect. However, this is where the similarity ends. There is no parallel filesystem or need for one, since these workloads are inherently different from HPC workloads.

The worker nodes in an analytics cluster are comprised of what are known as “data nodes.” Usually, data nodes are connected using an Ethernet network. However, some of the more powerful data analytics systems do use a high-speed interconnect like Mellanox IB. Data is stored locally on each node, and all incoming data is replicated and distributed across various data nodes in the cluster.

A separate dedicated “name node” keeps track of which piece of data is located on which data node. Jobs here, too, are run in batch. Data analytics also utilizes a scheduler/resource manager that serves a similar function to its HPC counterpart, i.e. keeps the system utilization at its highest. It should come as no surprise that, given the differences in architectures – but more importantly the difference in provenance of data analytics systems (as compared to their HPC brethren) – different types of schedulers are used. Yet Another Resource Negotiator, or YARN, is by far the most popular system used in such clusters and is fundamentally different to, say, SLURM.

Now, if you ever wondered why mixing HPC and data analytics/ML workloads was a pain in practice, look no further – it’s due to the difference in workload managers (among other things, of course)! It’s essentially like getting a native English speaker to be efficient at another language like, say, Japanese or vice versa. YARN and SLURM simply don’t get along. Well, that is until now… but before we get to the real solution, let’s look at a few of the options folks have been using.

Approaches to a Converged System

We live in a world of limited resources, where IT budgets are shrinking. So, imagine you’re an IT manager faced with demands from both HPC and data analytics camps, with architectures, software infrastructure and workflows that have as much similarity as chalk and cheese. When all this first started out, IT managers had no choice but to have dedicated infrastructure for each, as shown in Figure 3.

figure 3 underutilized cluster Dell EMC

From an IT perspective, they created a solution that met the needs of users, but it was a very inefficient approach at best. Especially when you consider the total cost of ownership for both systems. Not only are the systems running at less than full capacity, but the system administrators are managing two separate systems.

The better approach would be a single system that could run both types of workloads efficiently, such that each user would be presented with the user interface experience they are used to. Another caveat of having a unified system is that utilization can be significantly improved. An intelligent job scheduler can mix and match various types of HPC and DA jobs and keep the system idle time to a minimum.

There are lots of other benefits to a unified approach, for example, where a complex workflow with HPC and ML stages wouldn’t need to transfer massive quantities of data between separate systems, thereby saving time. In addition, if IT did have the budget to build two separate dedicated systems, it could now build a single system that is twice the size instead. So, users could run bigger jobs and have more capacity, and IT admins would only have to deal with one system. It’s a win-win no matter how you look at it. Now, if only it were simple to do this in reality!

Enter Intel’s Magpie

The challenges of creating a unified system were not lost on the user community. Over the past few years, several attempts have been made at a simple-to-manage system. However, a lack of adherence to standards, or a failure to incorporate best practices from both the HPC and DA/ML camps, created tools that left a lot to be desired.

Intel’s Magpie takes a fresh approach and applies these lessons learned into a reference design that not only includes all the details at a hardware level, but also includes software frameworks that work at a practical level. One of the additional advantages is that folks embarking on deep learning can put off purchasing accelerators, because the current generation of Xeons are more than capable of running most deep neural network training applications, especially when run on a scale-out system.

The Intel Magpie frameworks provide everything to meet the needs of basic users to the most advanced. Intel’s solution brief supplies details regarding recommended hardware, from processors to interconnects and fast storage SSD, as well as recommendations on management networks. The recommend Intel Xeon Gold 6126 and 6226 Processor optimizations benefit both HPC workloads (Intel AVX-512) and machine learning workloads (CNN and DNNs benefit with Intel DL Boost). However, as impressive as the hardware is, the integrated software is really where this solution stands out.

This software includes a batch scheduler that supports Magpie on the SLURM. As open source software, Magpie is less intrusive to the production software stack than its closed-source counterparts, and it supports multiple resource managers. Additional software includes the Linux operating system, Intel Cluster Checker, OpenHPC, Intel OPA software, Intel Parallel Studio XE 2019 Cluster Edition, Apache Spark, TensorFlow and Horovod. And all this is just scratching the surface of what’s possible.

In short, the Intel Magpie solution provides everything that has been missing to create the much-sought-after holy grail of converged HPC/DA/ML/AI infrastructures. Notably, the one additional necessary piece is the server platform upon which all this technology would be installed. For that, I recommend one of several of the Dell PowerEdge server platforms, which are optimized for both HPC and AI.

Conclusion

On a final note, I do want to mention that the International Supercomputing Conference (ISC’19) will be held in Frankfurt Germany the week of June 16, 2019. I and several of my HPC/AI expert colleagues will be in attendance, and it would be a great opportunity to have productive discussions on what this converged HPC/data analytics/AI future looks like. So, let me know if you’re going to be around, or stop by Dell Technologies booth #C-1210.

To learn more

Copyright © 2019 IDG Communications, Inc.