Introduction#

This installation guide provides step-by-step instructions for deploying and configuring the complete NVIDIA Mission Control 2.0 software stack on NVIDIA GB200 NVL72 systems. It covers the following:

  • Installation and configuration of all software components required to enable full NVIDIA Mission Control functionality

  • Software dependencies for each feature

  • Installation and deployment of the features themselves

  • Verification and testing procedures to ensure proper feature functionality

For more information regarding control plane hardware information and requirements, see: https://apps.nvidia.com/PID/ContentLibraries/Detail?id=1137731&srch=nmc%20hardware.

An overview of Mission Control is shown in Figure 1.

Mission Control software architecture

Figure 1 Mission Control Software Architecture#

Assumptions and Prerequisites#

Before installing the Mission Control software, you must complete the following tasks within BCM 11, as outlined in the NVIDIA Mission Control Management Plane and Rack Setup with NVIDIA GB200 NVL72 Systems Installation Guide:

  • All networks are defined and all switches (in-band and out-of-band) are “Up”

  • All control plane nodes are configured and “Up”:

    • slogin

    • K8s-system-admin

    • K8s-system-user

  • High Availability (HA) setup is configured and failover is verified.

  • The GB200 NVL72 rack(s) setup is complete and the following items are verified:

    • All NVLink switch chips are online with NMX-C/T enabled

    • NVLink switch leader is assigned

    • Each GB200 NVL72 rack contains 9 NVLink switch trays.

    • All 18 compute trays per rack are provisioned and in an “Up” state

    • Power control is established at rack, compute tray, and NVLink switch tray levels

  • NFS setup is complete and available

  • A valid BCM license with Mission Control enabled is installed

Mission Control Components#

The NVIDIA Mission Control 2.0 control plane provides a modular, scalable, and secure architecture based on the NVIDIA GB200 NVL72 platform. Through its flexible design, customers gain seamless access to all system capabilities and resources using a centralized administrative interface. This enables streamlined operations across distributed infrastructure components without the complexity of managing multiple administrative domains.

Admin Control Plane#

The NVIDIA Mission Control 2.0 control provides a centralized administrative interface for managing the cluster. It includes the following components:

  • Head Nodes - x86 (Cluster deployment, management, and monitoring):

    • Base Command Manager (BCM) providing: * GUI, CLI, and API interfaces * OS provisioning * Observability* * Network provisioning * Rack and inventory management * Power profiles* * Leak monitoring and CHP * Slurm workflow software

  • Admin Service Nodes - x86 (K8s and BCM-integrated services):

    • BCM-Integrated Services: * NMX Manager* * Observability stack* * Autonomous Hardware Recovery (AHR)* * Autonomous Job Recovery (AJR)* * Power Reservation Steering (PRS) Service*

    • Common K8s Services including: * Loki, Prometheus, operators, and other components

    • BCM-Provisioned K8s

User Control Plane#

The NVIDIA Mission Control 2.0 control provides a user control plane interface for managing the cluster. It includes the following components:

  • Slurm Nodes - Arm64 (User access to Slurm cluster): * BCM-provisioned Slurm submission software

  • User Service Nodes - x86 (K8s and BCM-integrated services):

    • Run:AI components: * Control plane * Scheduler

    • Common K8s Services including: * GPU Operator, DRA, Network Operator

    • BCM-Provisioned K8s

Compute Plane#

The NVIDIA Mission Control 2.0 control provides a compute plane interface for managing the cluster. It includes the following components:

  • GB200 Rack (8 Racks per SU): * Execution hosts with compute trays containing CPU, GPU, and memory * 18 compute trays per rack with integrated hardware components

Additional Systems#

The control plane integrates with the following additional infrastructure components:

  • BMS (Bare Metal Service) - Customer-provided API-compliant BMS*

  • NVLink Switches and InfiniBand Switches & UFM

  • Ethernet Switches

  • NFS Storage and HFS Storage

Key Features#

The NVIDIA Mission Control 2.0 control provides the following key features:

  • Supports both training and inference workloads

  • Provides centralized management for configuration and observability

  • Delivers a standardized, scalable, secure control plane for all NVIDIA GB200 NVL72 systems

Note

Components marked with an asterisk (*) are new in Mission Control 2.0.0.