Introduction#

This document details the following:

  • Installation and configuration of all software components required to enable full NVIDIA Mission Control functionality

  • Software dependencies for each feature

  • Installation and deployment of the features themselves

  • Verification and testing procedures to ensure proper feature functionality

For control plane hardware information and requirements, go to: https://apps.nvidia.com/PID/ContentLibraries/Detail?id=1137731&srch=nmc%20hardware

An overview of Mission Control is shown in Figure 1.

Mission Control software architecture

Figure 1 Figure 1. Mission Control Software Architecture#

Assumptions and Prerequisites#

Before installing Mission Control software, complete the following tasks within BCM 11, as outlined in the NVIDIA Mission Control Management Plane and Rack Setup with NVIDIA GB200 NVL72 Systems Installation Guide:

  • All networks are defined and all switches (in-band and out-of-band) are “Up”

  • All control plane nodes are configured and “Up”:

    • K8s-CTRL-nodes

    • slogin

    • NMX-M

  • HA setup is configured and failover is verified

  • GB200 NVL72 rack(s) setup is completed:

    • All NVLink switch chips are online with NMX-C/T enabled

    • NVLink switch leader is assigned

    • Each GB200 NVL72 rack contains 9 NVLink switch chips

    • All 18 compute trays per rack are provisioned and in an “Up” state

    • Power control is established at rack, compute tray, and NVLink switch tray levels

  • NFS setup is complete and available

  • A valid BCM license with Mission Control enabled is installed

Mission Control Components#

The NVIDIA Mission Control 2.0 control plane provides a modular, scalable, and secure architecture based on the NVIDIA GB200 NVL72 platform. Through its composable design, customers gain seamless access to all system capabilities and resources via a centralized administrative interface, enabling streamlined operations across distributed infrastructure components without the complexity of managing multiple administrative domains.

Admin Control Plane

  • Head Nodes - x86 (Cluster deployment, management, and monitoring):

    • Base Command Manager (BCM) providing: * GUI, CLI, and API interfaces * OS provisioning * Observability* * Network provisioning * Rack and inventory management * Power profiles* * Leak monitoring and CHP * Slurm workflow software

  • Admin Service Nodes - x86 (K8s and BCM-integrated services):

    • BCM-Integrated Services: * NMX Manager* * Observability stack* * autonomous hardware recovery (AHR)* * autonomous job recovery (AJR)* * Power Reservation Steering (PRS) Service*

    • Common K8s Services including: * Loki, Prometheus, operators, and other components

    • BCM-Provisioned K8s

User Control Plane

  • Slurm Nodes - Arm64 (User access to Slurm cluster): * BCM-provisioned Slurm submission software

  • User Service Nodes - x86 (K8s and BCM-integrated services):

    • Run:AI components: * Control plane * Scheduler

    • Common K8s Services including: * GPU Operator, DRA, Network Operator

    • BCM-Provisioned K8s

Compute Plane

  • GB200 Rack (8 Racks per SU): * Execution hosts with compute trays containing CPU, GPU, and memory * 18 compute trays per rack with integrated hardware components

Additional Systems

The control plane integrates with additional infrastructure components:

  • BMS (Bare Metal Service) - Customer-provided API-compliant BMS*

  • NVLink Switches and InfiniBand Switches & UFM

  • Ethernet Switches

  • NFS Storage and HFS Storage

Key Features

  • Supports both training and inference workloads

  • Provides centralized management for configuration and observability

  • Delivers a standardized, scalable, secure control plane for all NVIDIA GB200 NVL72 systems

Note: Components marked with an asterisk () are new in Mission Control 2.0.0*