The GPU utilization challenge
Enterprises are increasingly investing in machine and deep learning applications that turn massive amounts of data into artificial intelligence applications that generate business value. In a parallel trend, organizations are investing in server accelerators to speed up the process of training the machine and deep learning models that drive AI applications.
But are they getting the full value of those investments? Probably not, based on studies of GPU utilization. Many GPUs in the physical bare-metal world are stuck in silos and grossly underutilized. Enterprises surveys have shown that GPUs are utilized only 15 – 30 percent of the time. While some GPUs are shared in servers, many are bound to dedicated to individual users.
In a typical workflow, a researcher will set up a large number of experiments, wait for them to finish and on completion, work to digest the results, while the GPUs sit idle. Part of the problem here is that the physical nature of the infrastructure for GPUs does not allow for secure access and sharing across teams with multi-tenancy. The result is basically money going down the drain, as the organization gets only a piece of the potential value it could be getting from GPUs more fully utilized.
This brings us to the news of the day: the launch of a new Dell Technologies solution for virtualizing GPUs and machine learning environments to offer GPUs as a Service (GPUaaS).
The solution: share the GPU goodness
GPU virtualization enables organizations to consolidate multiple siloed GPU clusters into a single shared resource pool that allows many users to share the workload acceleration of the GPUs — and drive up utilization. This is the end goal of the new Dell Technologies solution.
The solution combines VMware virtualization and container orchestration with the latest Dell EMC servers, networking and storage to provide robust infrastructure for machine and/or deep learning applications and users. VMware Cloud Foundation introduces the concept of a “workload domain,” which is a set of VMs and resources that contain a particular workload. A machine learning and GPU application workload domain is used to leverage VMware capabilities, while optimizing server GPU usage. High speed PVRDMA-based networking is leveraged to reduce network impact.
This virtualized GPU workload domain introduced in this solution combines the best of VMware virtualization software and Dell EMC infrastructure to provide a robust yet flexible solution for GPU users. This end-to-end solution includes guidance to deploy a GPU workload domain to meet the common use cases for machine learning and high performance computing (HPC) applications.
Realizing the benefits
A lot of great things come out of this new reference architecture for virtualizing GPUs and machine learning environments. This architecture enables organizations to:
- Achieve higher utilization of each individual GPU
- Gain greater efficiency through the sharing of GPUs across users and applications
- Allow users to make use of partial or multiple GPUs on a case-by-case basis as their applications need them
- Enable an elastic GPU as a service (GPUaaS) model that supports the dynamic assignment of GPU resources based on an organization’s business needs and priorities
Here’s the bottom line: As enterprises invest more heavily in the development of machine learning algorithms and AI-driven applications, many will make proportional investments in GPUs. Virtualization and the new Dell Technologies solutions can help organization ensure that they make the most of these investments.
To learn more
For a closer look at the new solutions, visit Dell Technologies Reference Architectures. Learn more about VMware solutions with PowerEdge servers at delltechnologies.com/poweredge-vmware.