Answering Key Questions About the Virtualization of HPC and AI Workloads

skyscraper 01
Dell Technologies

Back in 1999, VMware broke new ground with the launch of its first virtualization product, which allowed multiple operating systems to run on a single computer. In the years since, virtualization of IT resources has swept through data centers around the world — and for good reason.

With server virtualization, you can turn a single physical host into multiple virtual systems, each running its own OS and applications. This ability to break a big computing resource into lots of smaller virtual machines helps IT shops increase asset utilization, accelerate provisioning, gain agility and more.

But what about high performance computing? Until recently, HPC systems haven’t seen a lot of virtualization, in part because of concerns about potential performance penalties stemming from the addition of a layer of software — a hypervisor — between the operating system and the hardware. But today, this is  changing, as organizations are increasingly virtualizing HPC systems.

So what’s behind this emerging trend? We took up this topic in an interview with Chuck Gilbert, an HPC technologist in the Dell Technologies Office of the CTO. What follows are a few excepts from our interview with Chuck.

Q: Why would anyone look closely at virtualizing HPC or AI workloads?

CG: With the change in the complexity and the number of tools that are available to researchers and IT workers, administrators and scientists, there is an increasing need for the ability to rapidly build composable, flexible environments to support a very diverse and ever-evolving set of research needs.

A traditional HPC infrastructure is going to have a certain set of common tools installed, a queuing system installed and a traditional module system to load the software pieces. A lot of the emerging tools that we are seeing in the space require the ability to rapidly spin up and spin down unique disparate HPC environments. These tools include AI and machine learning, new and improved web-driven tools from an HPC workflow standpoint, new graphical interactive tools and visualization support.

In a traditional HPC environment, it doesn’t play well. In a virtualized environment, you can afford the easy ability to use elastic composable compute on top of a set of common hardware. By taking advantage of the different software-defined components that exist in a virtualized platform, an IT service provider can customize the environment and the experience for the researcher or end user in order to support their diverse needs, rather than offering a one-size-fits-all approach.

Q: What are HPC workloads like before virtualization versus after?

CG: Depending on how you deploy and configure your virtualized environment, the workloads are no different when virtualized. You can set up a standard HPC software scheduler, whether it is using SLURM or another type of open source scheduling package, to interact with virtualized compute nodes in the same way that you interact with bare-metal compute nodes.

There are a lot of things that could be done to make the overall experience seamless and transparent to the end user, but what administrators, operators and service providers are getting on the backend is the ability to do things better. With virtualization, they can increase the overall utilization of HPC systems and drive flexibility in the hardware configurations. They also gain the ability to easily manage, update and deploy new hardware using a common set of automated and scalable best practices.

Q: People running HPC workloads want all the performance they can get. Naturally, they are concerned that virtualization will impact performance. What are your thoughts on that?

CG: If you look at the current performance impacts of a hypervisor today, it’s clear we have come a long way in terms of reducing the overhead that [VMware] ESXi or any type of hypervisor technology can impact on physical resources.

In this scenario, you can get anywhere from 1 percent to 4 percent as the maximum overhead impact on a physical resource. If you think about the gains in operational efficiency, flexibility, programmability and ease of maintenance, you could say, “Yes, there may be a little bit of an impact on performance, but is 1 percent to 4 percent really worth arguing against all of the advantages that I am getting from an operational and flexibility standpoint?"

In most cases, if we look across even a physical HPC infrastructure — except in the cases of very large leadership-class systems that are being tuned down to the very micro-code level on a constant basis — those operational efficiencies are going to outweigh any potential slight impact or overhead from a performance standpoint that stems from running on top of a hypervisor.

To learn more

For the full story, see the Q&A with Chuck Gilbert: “Virtualizing HPC:A Perspective from the Office of the CTO.”

For more information about you can accelerate your AI journey, explore AI Solutions from Dell Technologies and the Intel AI Builders.

Copyright © 2020 IDG Communications, Inc.