In projects under way in the Dell EMC HPC and AI Innovation Lab, data science teams are leveraging Kubernetes containers to streamline and accelerate the development of deep learning solutions. Credit: Dell EMC At the Dell EMC HPC and AI Innovation Lab in Austin, we’re focused heavily on the development, training and optimization of deep learning models that run in high performance computing environments. These models invariably come with a complex software stack that makes them difficult to deploy on individual servers, let alone dozens or hundreds of server nodes. This is why we use containers with Kubernetes orchestration to distribute and manage our applications. As the name suggests, a container is a package that bundles up a software application and all of the pieces and parts needed to deploy the software on supported host systems. This includes a virtualized operating system and all of its software dependencies. In the case of a deep learning environment, the pieces that go into the container will also include a framework for deep learning, such as TensorFlow or PyTorch, and all the dependent libraries and packages for the framework. The container approach enables an organization to duplicate a software environment across many server nodes without worrying about replicating the software environment and the application configurations on the individual servers. The container runs on top of the host operating systems. This approach greatly streamlines the distribution and management of complex environments like those used in deep learning. This brings us to Kubernetes, an open source project launched by Google and now hosted by the Cloud Native Computing Foundation (CNCF). Kubernetes is a container orchestration engine for automating the deployment, scaling and management of containerized applications. It groups the containers that make up an application into logical units for easy management and discovery. In our HPC and AI Innovation Lab, we use Kubernetes containerization to speed up and streamline the production and distribution of deep learning training workloads to thousands of CPU and accelerator nodes in our Zenith and Rattler supercomputing clusters. Kubernetes allows us to develop an entire deep learning environment on a single system, wrap it up in a container and distribute it to the host systems. We then use Kubernetes to schedule and orchestrate jobs across many processors, similar to the way we used batch scheduling with the HPC systems of the past. With these capabilities, Kubernetes is helping us accelerate throughput for training deep learning workloads and reduce the time required to develop innovative solutions for natural language processing, image and video classification, recommendation engines, and more. In projects like these, and in the work we do directly with Dell EMC customers, we tailor and tweak the underlying infrastructure for the workload, so each application gets the right amount of compute power, the right storage and the right network fabric. We also work to show customers the differences in open source and enterprise Kubernetes solutions, and to help them understand why Kubernetes has become the de facto standard for the development of deep learning applications. Ultimately, Kubernetes is helping us shape new products for our customers based on deep learning and artificial intelligence. In this work in the lab, we are defining what our next-generation infrastructure solutions will look like, including those in the growing portfolio of Dell EMC Ready Solutions for AI. Along the way, we’re showing our customers how they can accelerate time to value for deep learning solutions by doing Kubernetes the right way. To learn more To learn more about the resources available through the Dell EMC HPC and AI Innovation Lab, visit com/innovationlab. To explore new HPC solutions for powering AI-driven applications, visit Dell EMC Ready Solutions for AI. John Lockman is an HPC and AI data science systems engineering specialist in the Dell EMC HPC and AI Innovation Lab. Related content BrandPost Making Remarkable Energy Grids a Reality Combine IT agility and operational technology (OT) to deliver sustainable power to an energy-hungry world By David Holmes, General Manager, Energy at Dell Technologies Jan 31, 2023 7 mins IT Leadership BrandPost The Reason Many AI and Analytics Projects Fail—and How to Make Sure Yours Doesn’t As the pace of innovation in these areas accelerates, now is the time for technology leaders to take stock of everything they need to successfully leverage AI and analytics. By Tanya O'Hara Jan 20, 2023 8 mins IT Leadership BrandPost The Technology Enabling Successful Hybrid Workforce Transformation Why more companies are shifting to VDI on a private cloud By George O’Toole III, VDI Solutions Marketing, Dell Technologies Jan 20, 2023 9 mins IT Leadership BrandPost Innovative Manufacturers are Investing in these Advanced Technologies To stay competitive, factories will need AI and edge computing—here’s why By Mariah Petrovic, AI Solutions Marketing, Dell Technologies Jan 12, 2023 8 mins IT Leadership Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe