Introduction#

This document is intended for system admins using a DGX B200 SuperPOD or DGX B200 BasePOD with NVIDIA Mission Control (NMC) 2.0 software. In this software release the Mission Control software integrates together the NVIDIA Base Command Manager (BCM) tooling for cluster provisioning and infrastructure management with either Slurm or NVIDIA Run:ai for workload scheduling, and orchestration of training, inference, or development workloads.

Base Command Manager#

Base Command Manager is used for the initial cluster bring-up, networking configuration, firmware upgrades, OS provisioning, and deployment of either Slurm or Kubernetes and Run:ai.

For detailed steps on cluster-bring up or deployment, refer to the NVIDIA DGX SuperPOD and BasePOD Deployment Guide with DGX B200 Systems and NVIDIA Mission Control 2.0 document.

For complete onboarding of new admins and documentation on how an admin can use Base Command Manager, refer to the official NVIDIA Base Command Manager Administrator Manual. Note that the BCM documents are per-version and the SBOM for NMC 2.0 include version 11.

Additional end-user documentation can be found in the NMC 2.0 for B200 Systems User Guide.

Slurm#

Slurm is deployed through a Base Command Manager wizard and enables topology-aware HPC scheduling for GPU workloads across the SuperPOD or BasePOD.

Deployment of Slurm is covered in the NMC 2.0 for B200 Deployment Guide.

The onboarding and administration of Slurm is covered in both the official NVIDIA Base Command Manager Administrator Manual with additional details in the official Slurm documentation.

Additional end-user documentation can be found in the NMC 2.0 for B200 Systems User Guide.

NVIDIA Run:ai#

NVIDIA Run:ai is deployed through a Base Command Manager wizard. The BCM wizard installs a Kubernetes cluster, all Run:ai dependencies, and then installs Run:ai onto the Kubernetes cluster. This enables topology-aware HPC-style scheduling for GPU workloads across the SuperPOD or BasePOD.

Deployment of NVIDIA Run:ai is covered in the NMC 2.0 for B200 Deployment Guide.

The administration of Run:ai includes topics such as queue or quota management, system or job monitoring, authn or authz configuration, job optimization, and much more. This is documented in the “Platform Management” section of the official Run:ai 2.22 documentation. Note that the Run:ai documents are per-version and the SBOM for NMC 2.0 include version 2.22.

Additional end-user documentation can be found in the NMC 2.0 for B200 Systems User Guide.