Introduction#

This document is intended for system admins using a DGX B200/B300 SuperPOD or DGX B200/B300 BasePOD with NVIDIA Mission Control (NMC) 2.1 software. In this software release the Mission Control software integrates together the NVIDIA Base Command Manager (BCM) tooling for cluster provisioning and infrastructure management with either Slurm or NVIDIA Run:ai for workload scheduling, and orchestration of training, inference, or development workloads.

Base Command Manager#

Base Command Manager is used for the initial cluster bring-up, networking configuration, firmware upgrades, OS provisioning, and deployment of either Slurm or Kubernetes and Run:ai.

For detailed steps on cluster-bring up or deployment Please contact NVIDIA Support.

For complete onboarding of new admins and documentation on how an admin can use Base Command Manager, refer to the official NVIDIA Base Command Manager Administrator Manual. Note that the BCM documents are per-version and the SBOM for NMC 2.1 include version 11.

Additional end-user documentation can be found in the NMC 2.1 for B200/B300 Systems User Guide.

Slurm#

Slurm is deployed through a Base Command Manager wizard and enables topology-aware HPC scheduling for GPU workloads across the SuperPOD or BasePOD.

The onboarding and administration of Slurm is covered in both the official NVIDIA Base Command Manager Administrator Manual with additional details in the official Slurm documentation.

Additional end-user documentation can be found in the NMC 2.1 for B200/B300 Systems User Guide.

NVIDIA Run:ai#

NVIDIA Run:ai is deployed through a Base Command Manager wizard. The BCM wizard installs a Kubernetes cluster, all Run:ai dependencies, and then installs Run:ai onto the Kubernetes cluster. This enables topology-aware HPC-style scheduling for GPU workloads across the SuperPOD or BasePOD.

The administration of Run:ai includes topics such as queue or quota management, system or job monitoring, authn or authz configuration, job optimization, and much more. This is documented in the “Platform Management” section of the official Run:ai 2.23 documentation. Note that the Run:ai documents are per-version and the SBOM for NMC 2.1 include version 2.23.

Additional end-user documentation can be found in the NMC 2.1 for B200/B300 Systems User Guide.