Introduction#

NVIDIA Mission Control is an integrated software platform designed for managing and orchestrating large-scale GPU computing clusters. It provides comprehensive deployment, management, and monitoring capabilities for high-performance computing (HPC) and AI workloads across different system configurations.

Platform Overview#

Mission Control delivers a unified control plane that simplifies the complexity of managing GPU-accelerated infrastructure. The platform combines cluster orchestration, workload scheduling, and system monitoring into a cohesive management framework built on Kubernetes and industry-standard tools.

Key capabilities include:

  • Automated Deployment: Streamlined provisioning and configuration of compute resources

  • Workload Orchestration: Advanced scheduling and resource allocation for AI/ML and HPC jobs

  • System Monitoring: Real-time observability and health tracking across the infrastructure

  • Slurm Integration: Native support for Slurm workload manager with BCM-provisioned clusters

Supported System Configurations#

Mission Control supports multiple system architectures, each optimized for different scale and performance requirements:

DGX B300 Systems#

Architecture Version: v2.1.0

DGX B300 Architecture

Figure 1 Mission Control Software Architecture – B300 (v2.1.0)#

The DGX B300 configuration represents NVIDIA’s high-density GPU computing platform, featuring:

  • Compute Plane: DGX B300 - 1SU systems with 64-72 DGX units per superunit

  • Control Plane:

    • BCM Head Nodes (x86) - x2 nodes for cluster management

    • Run:AI Management Nodes (x86) - x3 nodes for AI workload orchestration

    • Admin Kubernetes Nodes (x86) - x3 nodes for infrastructure services

  • User Access: Slurm Nodes (x86) - x2 nodes for job submission

  • Key Features:

    • Run:AI integration for AI workload management

    • autonomous job recovery (AJR) and autonomous hardware recovery (AHR) capabilities

    • NetQ network monitoring and observability stack

DGX B200 Systems#

Architecture Version: v2.0.0

DGX B200 Architecture

Figure 2 Mission Control Software Architecture – B200 (v2.0.0)#

The DGX B200 configuration provides a balanced architecture for production AI workloads:

  • Compute Plane: DGX B200 - 1SU systems with 32 DGX units per superunit

  • Control Plane:

    • BCM Head Nodes (x86) - x2 nodes

    • Run:AI Management Nodes (x86) - x3 nodes

  • User Access: Slurm Nodes (x86) - x2 nodes

  • Key Features:

    • CMSH CLI interface for Base Command Manager

    • Streamlined architecture focused on core orchestration capabilities

NVIDIA GB300 NVL72 Systems#

Architecture Version: v2.1.0

GB300 Architecture

Figure 3 Mission Control Software Architecture – GB300 (v2.1.0)#

The GB300 configuration introduces Arm64-based control infrastructure for next-generation systems:

  • Compute Plane: GB300 Rack with 8 racks per superunit, organized in compute trays

  • Control Plane:

    • BCM Head Nodes (Arm64) - x2 nodes

    • Admin Kubernetes Nodes (x86) - x3 nodes

  • User Access:

    • Slurm Nodes (Arm64) - x2 nodes

    • User Kubernetes Node (Arm64) - x3 nodes for Run:AI and user workloads

  • Key Features:

    • Hybrid Arm64/x86 architecture

    • CMSH CLI and NVLink switches for high-speed interconnect

    • Customer-provided BMS (Building Management System) support

    • Leak monitoring and control capabilities

NVIDIA GB200 NVL72 Systems#

Architecture Version: v2.0.0

GB200 Architecture

Figure 4 Mission Control Software Architecture – GB200 (v2.0.0)#

The GB200 configuration delivers advanced capabilities with separated control and user planes:

  • Compute Plane: GB200 Rack with 8 racks per superunit in compute tray organization

  • Admin Control Plane:

    • Head Nodes (x86) - x2 nodes

    • Admin Service Nodes (x86) - x3 nodes with AHR, AJR, NMX Manager, and Observability Stack

  • User Control Plane:

    • Slurm Nodes (Arm64) - x2 nodes

    • User Service Nodes (x86) - x3 nodes with Run:AI orchestration

  • Key Features:

    • Separated admin and user control planes for enhanced security and isolation

    • Autonomous Hardware Recovery (AHR) and Autonomous Job Recovery (AJR)

    • NMX Manager for network fabric management

    • Customer-provided BMS integration

Architecture Components#

Base Command Manager (BCM)#

The Base Command Manager serves as the central management interface for Mission Control, providing:

  • GUI: Web-based Base Command View for visual cluster management

  • CLI: Command-line interface (CMSH for x86 systems, CMSH for Arm64 systems)

  • API: RESTful API for programmatic automation

  • Core Functions:

    • OS provisioning and firmware updates

    • Network provisioning and health checks

    • Inventory management and power profiles

    • Observability and monitoring integration

Run:AI Orchestration#

Run:AI provides advanced AI workload management with:

  • Intelligent scheduling and resource allocation

  • GPU sharing and fractionalization

  • Workload prioritization and fairness policies

  • Integration with common machine learning frameworks

Kubernetes Infrastructure#

Mission Control leverages Kubernetes for:

  • Container orchestration across control and compute planes

  • Service discovery and load balancing

  • Declarative configuration management

  • BCM-provisioned and common Kubernetes services (Loki, Prometheus, network operators)

Slurm Workload Manager#

BCM-provisioned Slurm integration enables:

  • Traditional HPC job scheduling

  • Resource allocation and queue management

  • Integration with existing HPC workflows

  • Slurm submission software for user access

Network and Storage Systems#

Mission Control integrates with enterprise infrastructure:

  • Networking: InfiniBand switches and UFM, Ethernet switches, NVLink switches (GB300), NVLink switches (GB200)

  • Storage: NFS storage for shared filesystems, HFS storage for high-performance workloads

Version Information#

This documentation covers Mission Control version 2.0.0 and 2.1.0, supporting the latest DGX B300, DGX B200, NVIDIA GB300 NVL72, and NVIDIA GB200 NVL72 system architectures. New features introduced in version 2.1.0 include enhanced observability capabilities, network monitoring with NetQ, and expanded Arm64 support.

Getting Started#

To begin using Mission Control:

  1. Review the system requirements for your target architecture (B300, B200, GB300, or GB200)

  2. Follow the installation procedures specific to your configuration

  3. Configure the Base Command Manager for your cluster topology

  4. Provision compute nodes and verify system health

  5. Set up user access through Slurm or Kubernetes interfaces

The following chapters provide detailed instructions for installation, configuration, and operation of Mission Control across all supported system architectures.