Introduction#

NVIDIA Mission Control is an integrated software platform designed for managing and orchestrating large-scale GPU computing clusters. It provides comprehensive deployment, management, and monitoring capabilities for high-performance computing (HPC) and AI workloads across different system configurations.

Platform Overview#

Mission Control delivers a unified control plane that simplifies the complexity of managing GPU-accelerated infrastructure. The platform combines cluster orchestration, workload scheduling, and system monitoring into a cohesive management framework built on Kubernetes and industry-standard tools.

Key capabilities include:

  • Automated Deployment: Streamlined provisioning and configuration of compute resources

  • Workload Orchestration: Advanced scheduling and resource allocation for AI/ML and HPC jobs

  • System Monitoring: Real-time observability and health tracking across the infrastructure

  • Slurm Integration: Native support for Slurm workload manager with BCM-provisioned clusters

Supported System Configurations#

Mission Control supports multiple system architectures, each optimized for different scale and performance requirements:

DGX B200/B300 Systems#

Architecture Version: v2.2

Mission Control Software Architecture – DGX B200/B300 Systems (v2.2)

Mission Control 2.2 Software Architecture - B200/300

Figure 1 Mission Control 2.2 Software Architecture – B200/300#

The DGX B200/B300 configuration represents NVIDIA’s high-density GPU computing platform with unified architecture supporting both B200 and B300 systems.

Admin Control Plane#

Head Nodes - x86 (x2)

Cluster deployment, management, and monitoring functionality with Base Command Manager (BCM):

  • GUI: Base Command View for visual cluster management

  • CLI: CMSH command-line interface

  • API: RESTful API for programmatic automation

  • Core BCM Functions:

    • OS Provision and Observability

    • Network Provisioning and FW Update

    • Rack & Inventory Management and Health Check

    • Leak Monitoring & Control and Power Profiles

    • Slurm Workflow SW integration

Admin Service Nodes - x86 (x3)

Kubernetes and BCM-integrated services providing:

BCM-Integrated Services:

  • Mission Control-autonomous hardware recovery: Automated hardware fault detection and recovery

  • Mission Control-autonomous job recovery: Automatic job restart and fault tolerance

  • Domain Power Services (DPS)- Early Preview: Power management and optimization

  • Observability Stack: Comprehensive observability and monitoring

Common K8s Services:

  • Loki, Prometheus, Operators, and other Kubernetes ecosystem tools

  • BCM-Provisioned K8s cluster management

User Control Plane#

Slurm Nodes - x86 (x2)

User access to Slurm cluster with BCM-Provisioned Slurm and Slurm Submission SW.

User Service Nodes - x86 (x3)

Run:ai and other User K8s code execution platform featuring:

  • Run:ai: AI workload orchestration with Control Plane and Scheduler

  • Common K8s Services: GPU Operator, DRA, Network Operator

  • BCM-Provisioned K8s: Kubernetes cluster provisioned by BCM

Compute Plane#

DGX B200 - 1SU (32 DGX per SU)

The compute plane consists of DGX systems organized in superunits (SU), each containing 32 DGX units. Each DGX node runs:

  • Slurm worker: HPC job execution

  • k8s + run:ai worker: AI/ML workload execution with Run:ai orchestration

  • Multiple CPU and GPU units per DGX system

Additional Systems#

  • IB Switches & UFM: InfiniBand networking and Unified Fabric Manager

  • Ethernet Switches: Standard Ethernet connectivity

  • NFS Storage: Network File System for shared storage

  • HFS Storage: High-performance file system

Key Features (B200/B300)#

  • Unified architecture supporting both DGX B200 and DGX B300 systems*

  • Separated Admin and User Control Planes for enhanced security

  • Mission Control-autonomous hardware recovery and Mission Control-autonomous job recovery

  • Run:ai integration with dedicated Control Plane and Scheduler

  • Domain Power Services (DPS)- Early Preview feature capability for power management

  • Comprehensive observability stack with Loki and Prometheus

Note: Some features available for B200, planned in future release for B300

GB200/GB300 NVL72 Systems#

Architecture Version: v2.2

Mission Control Software Architecture – GB200/GB300 NVL72 Systems (v2.2)

Mission Control 2.2 Software Architecture - GB200/300

Figure 2 Mission Control 2.2 Software Architecture – GB200/300#

The GB200/GB300 configuration delivers advanced capabilities with separated admin and user control planes, featuring ARM64-based architecture for next-generation systems.

Admin Control Plane#

Head Nodes - Arm (x2)

Cluster deployment, management, and monitoring with Base Command Manager (BCM):

  • GUI: Base Command View

  • CLI: CMSH for ARM64 systems

  • API: RESTful API

  • Core Functions:

    • OS Provision and Observability

    • Network Provisioning and FW Update

    • Rack & Inventory Management and Health Check

    • Leak Monitoring & Control and Power Profiles

    • Slurm Workflow SW

Admin Service Nodes - Arm (x3)

Kubernetes and BCM-integrated services:

BCM-Integrated Services:

  • NetQ: Network monitoring and observability

  • Mission Control-autonomous hardware recovery: Hardware fault management

  • Mission Control-autonomous job recovery: Job fault tolerance

  • Domain Power Services (DPS)- Early Preview: Power management and optimization

  • Observability Stack: Comprehensive observability and monitoring

Common K8s Services:

  • Loki, Prometheus, Operators, etc.

  • BCM-Provisioned K8s infrastructure

User Control Plane#

Slurm Nodes - Arm (x2)

User access to Slurm cluster with BCM-Provisioned Slurm and Slurm Submission SW.

User Service Nodes - Arm (x3)

Kubernetes and BCM-integrated services for user workloads:

  • Run:ai: Control Plane and Scheduler for AI workload orchestration (New in Mission Control 2.2)

  • Common K8s Services: GPU Operator, DRA, Network Operator

  • BCM-Provisioned K8s: User-space Kubernetes cluster

Compute Plane#

GB200 Rack (8 Racks per SU)

The compute plane consists of GB200/300 racks organized in compute trays, with 8 racks per superunit (SU). Each compute tray contains:

  • Slurm worker: HPC job execution

  • k8s + runai worker: AI/ML workload execution

  • CPU and GPU units: Combined CPU and GPU compute resources

Execution Hosts are organized in compute trays for optimal resource allocation.

Additional Systems#

  • BMS*: Customer-provided API-compliant Building Management System

  • NVLink Switches: High-speed GPU interconnect

  • IB Switches & UFM: InfiniBand networking

  • Ethernet Switches: Standard networking

  • NFS Storage: Network file storage

  • HFS Storage: High-performance storage

* = Customer-provided API-compliant BMS

Key Features (GB200/GB300)#

  • Native ARM64 architecture for head nodes and service nodes

  • Separated admin and user control planes for enhanced security and isolation

  • NetQ network monitoring and observability

  • Run:ai with dedicated control plane and scheduler

  • Mission Control-utonomous hardware recovery and Mission Control-autonomous job recovery

  • Customer-provided BMS integration support

  • NVLink switches for high-speed GPU interconnect

  • Compute tray organization for optimal resource management

Architecture Components#

Base Command Manager (BCM)#

The Base Command Manager serves as the central management interface for Mission Control, providing:

  • GUI: Web-based Base Command View for visual cluster management

  • CLI: Command-line interface (CMSH for x86 systems, CMSH for Arm64 systems)

  • API: RESTful API for programmatic automation

Core Functions#

  • OS provisioning and firmware updates

  • Network provisioning and health checks

  • Inventory management and power profiles

  • Observability and monitoring integration

  • Leak monitoring and control

  • Slurm workflow software integration

Run:ai Orchestration#

Run:ai provides advanced AI workload management with:

  • Intelligent scheduling and resource allocation

  • GPU sharing and fractionalization

  • Workload prioritization and fairness policies

  • Integration with common machine learning frameworks

  • Dedicated control plane and scheduler (v2.2)

Kubernetes Infrastructure#

Mission Control leverages Kubernetes for:

  • Container orchestration across control and compute planes

  • Service discovery and load balancing

  • Declarative configuration management

  • BCM-provisioned and common Kubernetes services (Loki, Prometheus, network operators)

Slurm Workload Manager#

BCM-provisioned Slurm integration enables:

  • Traditional HPC job scheduling

  • Resource allocation and queue management

  • Integration with existing HPC workflows

  • Slurm submission software for user access

Network and Storage Systems#

Mission Control integrates with enterprise infrastructure:

Networking:

  • InfiniBand switches and UFM for high-performance interconnect

  • Ethernet switches for standard networking

  • NVLink switches (GB300/GB200) for GPU-to-GPU communication

Storage:

  • NFS storage for shared filesystems

  • HFS storage for high-performance workloads

Autonomous Resiliency Engine (ARE)#

Mission Control-autonomous hardware recovery#

Automated detection and recovery from hardware failures:

  • Continuous health monitoring

  • Automatic fault isolation

  • Self-healing infrastructure capabilities

Mission Control-autonomous job recovery#

Intelligent job fault tolerance:

  • Automatic checkpoint and restart

  • Job state preservation

  • Minimal user intervention required

Observability Stack#

Comprehensive monitoring and observability:

  • Prometheus: Metrics collection and alerting

  • Loki: Log aggregation and querying

  • NetQ: Network monitoring and observability (GB200/GB300)

  • Real-time health tracking across infrastructure

  • Performance analytics and troubleshooting

Version Information#

This documentation covers Mission Control version 2.2, supporting the latest DGX B200/B300 and NVIDIA GB200/GB300 NVL72 system architectures.

Getting Started#

To begin using Mission Control:

  • Review the system requirements for your target architecture (B200/B300 or GB200/GB300)

  • Follow the installation procedures specific to your configuration

  • Configure the Base Command Manager for your cluster topology

  • Provision compute nodes and verify system health

  • Set up user access through Slurm or Kubernetes interfaces

The following chapters provide detailed instructions for installation, configuration, and operation of Mission Control across all supported system architectures.

Important Notes#

  • GB200/GB300 systems require customer-provided API-compliant BMS

  • All systems support both Slurm and Kubernetes-based workload submission

  • ARM64 architecture is native for GB200/GB300 control planes

  • Some features for B300 are planned for future releases