Abstract#

The number of use cases for AI within an enterprise, including examples such as language modeling, cybersecurity, autonomous systems, and healthcare, continues to expand quickly. Not only have the number of use cases grown, but model complexity and data sources also are growing. The system required to process, train, and serve these next generation models must also grow. Training models commonly use dozens of GPUs for evaluating and optimizing different model configurations and parameters. Training data must be readily accessible to all the GPUs for these new workloads. In addition, organizations have many AI researchers that must train numerous models simultaneously. Enterprises need the flexibility for multiple developers and researchers to share these resources as they refine and bring their AI stack to production.

NVIDIA DGX BasePOD™ provides the underlying infrastructure and software to accelerate deployment and execution of these AI workloads. By building upon the success of NVIDIA DGX systems, DGX BasePOD is a prescriptive AI infrastructure for enterprises, eliminating the design challenges, lengthy deployment cycle, and management complexity traditionally associated with scaling AI infrastructure. Powered by NVIDIA Base Command™, DGX BasePOD provides the essential foundation for AI development optimized for enterprise.

This reference architecture discusses the key components of DGX BasePOD and provides a prescriptive design for DGX BasePOD solutions.