The traditional approach for artificial intelligence (AI) and deep learning projects has been to deploy them in the cloud. Because it\u2019s common for enterprise software development to leverage cloud environments, many IT groups assume that this infrastructure approach will succeed as well for AI model training.\n\nFor many nascent AI projects in the prototyping and experimentation phase, the cloud works just fine. But companies often discover that as data sets grow in volume and AI model complexity increases, the escalating cost of compute cycles, data movement, and storage can spiral out of control. Called data gravity, it\u2019s the cost and workflow latency of bringing large data sets from where they\u2019re created to where compute resources reside. It has caused many companies to consider moving their AI training from the cloud back to an on-premises data center that is data-proximate.\n\nHybrid is a perfect fit for some AI projects \n\n\n\nThere\u2019s an alternative worth exploring \u2014 one that avoids forcing an either\/or choice around cloud and on-premises. A hybrid cloud infrastructure approach enables companies to take advantage of both environments. In this case, organizations can utilize on-premises infrastructure for their on-going \u201csteady state\u201d training demands, supplemented with cloud services for temporal spikes or unpredictable surges that exceed that capacity.\n\n\u201cThe saying: \u2018Own the base, rent the spike\u2019 captures the essence of this situation,\u201d says Tony Paikeday, senior director of AI systems at NVIDIA. \u201cEnterprise IT provisions on-prem infrastructure to support the steady-state volume of AI workloads and retains the ability to burst to the cloud whenever extra capacity is needed.\u201d\n\nThis approach secures continuous availability of compute resources for developers, while ensuring the lowest cost per training run.\n\nWith the rise of container orchestration platforms such as Kubernetes and others, enterprises can more effectively manage the allocation of compute resources that straddle between cloud instances and on-prem hardware, such as NVIDIA DGX A100 systems.\n\nFor example, aerospace company Lockheed Martin utilizes an approach where they run experiments on smaller AI models using GPU-enabled cloud instances, and their DGX server for training and inference on their largest projects. Although the AI team uses cloud, the DGX systems remain their sole resource for GPU compute, as it is more difficult to conduct model and data parallelism across cloud instances, says Paikeday.\n\nHe stresses that there isn\u2019t a single answer for all companies when it comes to the question of on-premises versus cloud-only versus hybrid approaches.\n\n\u201cDifferent companies approach this from different angles, and some will naturally gravitate to cloud, based on where their data sets are created and live,\u201d he says.\n\nFor others whose data lake resides on-prem or even in a colocation facility, they may eventually see the growing benefit of making their training infrastructure data-proximate, especially as their AI maturity grows.\n\n\u201cOthers who have already invested in on-prem will say that it\u2019s a natural extension of what they\u2019ve got,\u201d Paikeday says. \u201cSomewhere these two camps will meet in the middle, and both will embrace a hybrid infrastructure. Because of the nature and uniqueness of AI model development, they will realize that companies can have a balance of both infrastructure types.\u201d\n\nClick here to learn more about the benefits of using a hybrid infrastructure for your AI model development using NVIDIA DGX systems, powered by DGX A100 Tensor Core GPUs and AMD EPYC CPUs.\n\n\n\nAbout Keith Shaw:\n\nKeith is a freelance digital journalist who has written about technology topics for more than 20 years.