DGX BasePOD Overview

DGX BasePOD is an integrated solution consisting of NVIDIA hardware and software components, MLOps solutions, and third-party storage. Leveraging best practices of scale-out system design with NVIDIA products and validated partner solutions, customers can implement an efficient and manageable platform for AI development. The designs in this DGX BasePOD reference architecture (RA) support developer needs, simplify IT manageability, and infrastructure scaling from two nodes to dozens with certified storage platforms from an industry-leading ecosystem. Optional MLOps solutions can be integrated with DGX BasePOD to enable a full stack solution to shorten AI model development cycles and speed the ROI of AI initiatives.

Figure 1 highlights the various components of NVIDIA DGX BasePOD. Each of these layers is an integration point that users typically would have to build and tune before an application could be deployed. The designs in the RA simplify system deployment and optimization using a validated prescriptive architecture.

_images/image3.png

Figure 1. Layers of integration for DGX BasePOD

NVIDIA Networking

InfiniBand and Ethernet technologies enable networking functionality in DGX BasePOD. Proper networking is critical to ensuring that DGX BasePOD does not have any bottlenecks or suffer performance degradation for AI workloads. For more information on the products and technologies that enable this, refer to NVIDIA Networking.

Partner Storage Appliance

DGX BasePOD is built on a proven storage technology ecosystem. As NVIDIA validated storage partners introduce new storage technologies into the marketplace, they will qualify these new offerings with DGX BasePOD to ensure design compatibility and expected performance for known workloads. Every storage partner has performed rigorous testing to ensure that applications receive the highest performance and throughput when deployed with DGX BasePOD.

NVIDIA Software

NVIDIA Base Command

NVIDIA Base Command (Figure 2) powers every DGX BasePOD, enabling organizations to leverage the best of NVIDIA software innovation. Enterprises can unleash the full potential of their investment with a proven platform that includes enterprise-grade orchestration and cluster management, libraries that accelerate compute, storage and network infrastructure, and an operating system (OS) optimized for AI workloads.

_images/image4.png

Figure 2. NVIDIA Base Command features and capabilities with DGX BasePOD

DGX BasePOD hardware is further optimized with acceleration libraries that know how to maximize the performance of AI workload across a GPU, the DGX system and an entire DGX cluster, speeding data access, movement, and management from system I/O to storage to network fabric.

Base Command provides integrated cluster management from installation and provisioning to ongoing monitoring of systems—from one to hundreds of DGX systems. Base Command also supports multiple methods for workflow management. Either Slurm or Kubernetes can be used to allow for optimal scheduling and management of system resources within a multi-user environment.

NVIDIA NGC

NVIDIA NGC™ (Figure 3) provides software to meet the needs of data scientists, developers, and researchers with various levels of AI expertise.

_images/image5.png

Figure 3. NGC catalog overview

Software hosted on NGC undergoes scans against an aggregated set of common vulnerabilities and exposures (CVEs), crypto, and private keys. It is tested and designed to scale to multiple GPUs and in many cases, to multi-node, ensuring users maximize their investment in DGX systems.

NVIDIA AI Enterprise

NVIDIA AI Enterprise is the end-to-end software platform that brings generative AI into reach for every enterprise, providing the fastest and most efficient runtime for generative AI foundation models developed with the NVIDIA DGX platform. With production-grade security, stability, and manageability, it streamlines the development of generative AI solutions. NVIDIA AI Enterprise is included with DGX SuperPOD for enterprise developers to access pretrained models, optimized frameworks, microservices, accelerated libraries, and enterprise support.