NVIDIA DGX SuperPOD and BasePOD with DGX B200 Systems Deployment Guide with NVIDIA Mission Control 2.2#

Introduction#

This document provides the steps for deploying NVIDIA DGX BasePOD with DGX B200 systems and NVIDIA Mission Control 2.2

NVIDIA Mission Control#

NVIDIA Mission Control 2.2 for DGX B200 introduces the Autonomous Recovery Engine (ARE), a unified resiliency framework for Autonomous Hardware Recovery (AHR 28.4) and Autonomous Job Recovery (AJR 1.5).

The release includes NVIDIA Base Command Manager (BCM 11.31) and NVIDIA Run:AI 2.23, along with Observability stack providing full NMC capabilities for cluster management and workload orchestration.

NMC Reference#

NVIDIA Mission Control

NVIDIA Mission Control 2.2 release notes

NVIDIA DGX SuperPOD Reference Architecture

NVIDIA DGX BasePOD Reference Architecture

NVIDIA DGX SuperPOD and BasePOD with DGX B200 Systems Deployment Guide with NVIDIA Mission Control 2.0 with DGX B200 systems

Deploying NVIDIA DGX SuperPOD and BasePOD with DGX B200 and NMC 2.2#

The deployment and configuration is standardized across NVIDIA DGX B200 systems. Please refer to the NVIDIA DGX SuperPOD and BasePOD with DGX B200 Systems Deployment Guide with NVIDIA Mission Control 2.0 for detailed instructions, with the following changes for NMC 2.2

Control Plane Nodes#

The deployment requirements for NVIDIA Mission Control (NMC) 2.2, specifically tailored for the B200 system architecture, have been updated. A critical change involves the increase in the number of required control nodes, raising the total from the previous specification of 7 to 10 control nodes.

This expansion is a direct necessity to accommodate the integration of new software components within the NMC 2.2 stack. These components are essential for enhanced system management, orchestration, and monitoring capabilities.

It is important to note that despite this increase in control nodes, the fundamental process for setting up and provisioning these control nodes remains the same. Detailed steps are outlined in the deployment guide. This includes the steps for defining and configuring the control nodes as prerequisites for the NMC installation.

Refer to the NMC guide for the further details

The following summarizes the control node roles for NMC 2.2 on B200:

Admin Access Nodes:#

  • BCM Head Nodes (x2, x86): Provide cluster management capabilities

    via GUI, CLI (CMSH), and API interfaces.

  • Admin Kubernetes Nodes (x3, x86): Host BCM-integrated

    infrastructure services, including the Observability Stack, Autonomous Hardware Recovery (AHR), and Autonomous Job Recovery (AJR) and Common Kubernetes services (Loki, Prometheus etc.)

User Access Nodes:#

  • Slurm Nodes (x2, x86): Serve as the job submission interface,

    utilizing BCM-provisioned Slurm software.

  • User Kubernetes Nodes (x3, x86): Used for Run:ai orchestration

    (control plane and scheduler), common Kubernetes services (GPU Operator, Network Operator), and user workloads.

With the introduction of “admin nodes,” the former “k8s-nodes” have been renamed “k8s-user” nodes, and the new control nodes are “k8s-admin”

For NMC 2.2 and later, the six Kubernetes control nodes in the BCM configuration are required to use the specified category and kubernetes cluster names. This is mandatory to ensure a seamless migration to future automation tools once they are released.

Control Node Category

Control Node type

k8s-admin

3 control nodes for NMC admin k8s services (Observibility, AJR, AHR)

k8s-user

3 control nodes for NMC user k8s services (Run:ai)

Example output from a cluster

[bcm11-headnode->category]% ls
Name (key)                Software image                    Nodes
------------------------ ---------------------------------- --------
default                   default-image                      1
dgx                       dgx-image                          0
dgx-b200-k8s              dgx-b200-image                     32
k8s-admin                 k8s-admin-image                    3
k8s-user                  k8s-user-image                     3
slogin                    slogin-image                       2

NMC Install Guides#

Refer to the following links for installing NMC components on B200 SuperPOD/BasePOD

Run:ai#

Run:ai Guide for NMC 2.2 with Run:ai 2.23

NMC Admin-plane components#

NMC Admin-plane components - Guide/Link here