At the Dell EMC HPC and AI Innovation Lab in Austin, we’re focused heavily on the development, training and optimization of deep learning models that run in high performance computing environments. These models invariably come with a complex software stack that makes them difficult to deploy on individual servers, let alone dozens or hundreds of server nodes. This is why we use containers with Kubernetes orchestration to distribute and manage our applications.
As the name suggests, a container is a package that bundles up a software application and all of the pieces and parts needed to deploy the software on supported host systems. This includes a virtualized operating system and all of its software dependencies. In the case of a deep learning environment, the pieces that go into the container will also include a framework for deep learning, such as TensorFlow or PyTorch, and all the dependent libraries and packages for the framework.
The container approach enables an organization to duplicate a software environment across many server nodes without worrying about replicating the software environment and the application configurations on the individual servers. The container runs on top of the host operating systems. This approach greatly streamlines the distribution and management of complex environments like those used in deep learning.
This brings us to Kubernetes, an open source project launched by Google and now hosted by the Cloud Native Computing Foundation (CNCF). Kubernetes is a container orchestration engine for automating the deployment, scaling and management of containerized applications. It groups the containers that make up an application into logical units for easy management and discovery.
In our HPC and AI Innovation Lab, we use Kubernetes containerization to speed up and streamline the production and distribution of deep learning training workloads to thousands of CPU and accelerator nodes in our Zenith and Rattler supercomputing clusters. Kubernetes allows us to develop an entire deep learning environment on a single system, wrap it up in a container and distribute it to the host systems.
We then use Kubernetes to schedule and orchestrate jobs across many processors, similar to the way we used batch scheduling with the HPC systems of the past. With these capabilities, Kubernetes is helping us accelerate throughput for training deep learning workloads and reduce the time required to develop innovative solutions for natural language processing, image and video classification, recommendation engines, and more.
In projects like these, and in the work we do directly with Dell EMC customers, we tailor and tweak the underlying infrastructure for the workload, so each application gets the right amount of compute power, the right storage and the right network fabric. We also work to show customers the differences in open source and enterprise Kubernetes solutions, and to help them understand why Kubernetes has become the de facto standard for the development of deep learning applications.
Ultimately, Kubernetes is helping us shape new products for our customers based on deep learning and artificial intelligence. In this work in the lab, we are defining what our next-generation infrastructure solutions will look like, including those in the growing portfolio of Dell EMC Ready Solutions for AI.
Along the way, we’re showing our customers how they can accelerate time to value for deep learning solutions by doing Kubernetes the right way.
To learn more
John Lockman is an HPC and AI data science systems engineering specialist in the Dell EMC HPC and AI Innovation Lab.