The traditional approach for artificial intelligence (AI) and deep learning projects has been to deploy them in the cloud. Because it’s common for enterprise software development to leverage cloud environments, many IT groups assume that this infrastructure approach will succeed as well for AI model training.
For many nascent AI projects in the prototyping and experimentation phase, the cloud works just fine. But companies often discover that as data sets grow in volume and AI model complexity increases, the escalating cost of compute cycles, data movement, and storage can spiral out of control. Called data gravity, it’s the cost and workflow latency of bringing large data sets from where they’re created to where compute resources reside. It has caused many companies to consider moving their AI training from the cloud back to an on-premises data center that is data-proximate.
Hybrid is a perfect fit for some AI projects
There’s an alternative worth exploring — one that avoids forcing an either/or choice around cloud and on-premises. A hybrid cloud infrastructure approach enables companies to take advantage of both environments. In this case, organizations can utilize on-premises infrastructure for their on-going “steady state” training demands, supplemented with cloud services for temporal spikes or unpredictable surges that exceed that capacity.
“The saying: ‘Own the base, rent the spike’ captures the essence of this situation,” says Tony Paikeday, senior director of AI systems at NVIDIA. “Enterprise IT provisions on-prem infrastructure to support the steady-state volume of AI workloads and retains the ability to burst to the cloud whenever extra capacity is needed.”
This approach secures continuous availability of compute resources for developers, while ensuring the lowest cost per training run.
With the rise of container orchestration platforms such as Kubernetes and others, enterprises can more effectively manage the allocation of compute resources that straddle between cloud instances and on-prem hardware, such as NVIDIA DGX A100 systems.
For example, aerospace company Lockheed Martin utilizes an approach where they run experiments on smaller AI models using GPU-enabled cloud instances, and their DGX server for training and inference on their largest projects. Although the AI team uses cloud, the DGX systems remain their sole resource for GPU compute, as it is more difficult to conduct model and data parallelism across cloud instances, says Paikeday.
He stresses that there isn’t a single answer for all companies when it comes to the question of on-premises versus cloud-only versus hybrid approaches.
“Different companies approach this from different angles, and some will naturally gravitate to cloud, based on where their data sets are created and live,” he says.
For others whose data lake resides on-prem or even in a colocation facility, they may eventually see the growing benefit of making their training infrastructure data-proximate, especially as their AI maturity grows.
“Others who have already invested in on-prem will say that it’s a natural extension of what they’ve got,” Paikeday says. “Somewhere these two camps will meet in the middle, and both will embrace a hybrid infrastructure. Because of the nature and uniqueness of AI model development, they will realize that companies can have a balance of both infrastructure types.”
Click here to learn more about the benefits of using a hybrid infrastructure for your AI model development using NVIDIA DGX systems, powered by DGX A100 Tensor Core GPUs and AMD EPYC CPUs.
About Keith Shaw:
Keith is a freelance digital journalist who has written about technology topics for more than 20 years.