Overview#

This guide provides system administrators with the information needed to manage and maintain NVIDIA Mission Control clusters across different system architectures. It covers deployment, configuration, monitoring, and troubleshooting procedures for DGX B300, DGX B200, DGX GB300, and DGX GB200 systems.

Target Audience#

This guide is intended for:

System administrators responsible for deploying and managing Mission Control clusters
Infrastructure engineers configuring GPU computing environments
Operations teams monitoring and maintaining production systems
Technical staff performing system updates and troubleshooting

Prerequisites#

Before using this guide, you should have:

Familiarity with Linux system administration
Basic understanding of Kubernetes concepts and container orchestration
Knowledge of networking fundamentals (Ethernet, Infiniband)
Experience with storage systems (NFS, HSS)
Understanding of your specific system architecture (B300, B200, GB300, or GB200)

System Design#

The following diagram shows the logical design of the DGX SuperPOD:

The components shown in the diagram are described below:

Table 1 Component Descriptions#
DGX SuperPOD Component	Description
User Jumphost	The User Jumphost is the gateway into the DGX SuperPOD intended to provide a single entry-point into the cluster and additional security when required. It is not actually a part of the DGX SuperPOD, but of the corporate IT environment. This function is defined and provided by local IT requirements.
Admin Jumphost	The Admin Jumphost is the gateway into the DGX SuperPOD intended to provide a single entry-point for administrators into the cluster and additional security when required. It is not actually a part of the DGX SuperPOD, but of the corporate IT environment. This function is defined and provided by local IT requirements, and might be the same as the User Jumphost.
DGX Nodes / Compute Trays	The compute trays are where the user work gets done on the system. For DGX B-series systems (B300, B200), each DGX unit is a traditional GPU server in a standard rack configuration. For GB-series systems (GB300, GB200), compute resources are organized as compute trays within NVL72 racks, with each tray containing integrated CPU/GPU units.
Management Nodes	The management nodes provide the services necessary to support operation and monitoring of the DGX SuperPOD. Services, configured in high availability (HA) mode where needed, provide the highest system availability. See the Management Servers section below for details of each node and its function.
High-Speed Storage	High-speed storage (HSS) provides shared storage to all nodes in the DGX SuperPOD. This is where datasets, checkpoints, and other large files should be stored. High-speed storage typically holds large datasets that are being actively operated on by the DGX SuperPOD jobs. Data on the high-speed storage is a subset of all data housed in a data lake outside of the DGX SuperPOD.
Home Storage	Shared storage on a network file system (NFS) is allocated for user home directories as well for cluster services.
InfiniBand Fabric Compute	The Compute InfiniBand Fabric is the high-speed network fabric connecting all compute nodes together to allow high-bandwidth and low-latency communication between nodes and racks.
InfiniBand Fabric Storage	The Storage InfiniBand Fabric is the high-speed network fabric dedicated for storage traffic. Storage traffic is dedicated to its own fabric to remove interference with the node-to-node application traffic that can degrade overall performance.
In-Band Network Fabric	The In-band Network Fabric provides fast Ethernet connectivity between all nodes in the DGX SuperPOD. The In-band fabric is used for TCP/IP-based communication and services for provisioning and inband management.
Out-of-Band Network Fabric	The out-of-band Ethernet network is used for system management using the BMC and provides connectivity to manage all networking equipment.
NVLink	NVIDIA NVLink is a high-speed interconnect that allows multiple GPUs to communicate directly. Multi-Node NVLink is a capability enabled over an NVLink Switch network where multiple systems are interconnected to form a large GPU memory fabric also known as an NVLink Domain. Available on GB300 and GB200 systems.

Management Servers#

The following describes the function and services running on the management servers:

Table 2 DGX SuperPOD Management Servers#
Server Function	Services
Head Node	Head nodes serve various functions: Provisioning: Centrally store and deploy OS images of the compute, management nodes, and other various services. This ensures that there is a single authoritative source defining what should be on each node, and a way to re-provision if the node needs to be reimaged. Workload Management: Resource management and orchestration services that organize the resources and coordinate the scheduling of user jobs across the cluster. Metrics: System monitoring and reporting that gather all telemetry from each of the nodes. The data can be explored and analyzed through web services so better insight to the system can be studied and reported.
Login/Slurm Nodes	Entry point to the DGX SuperPOD for users. CPU-based nodes that are Slurm clients with filesystems mounted to support development, job submission, job monitoring, and file management. Multiple nodes are included for redundancy and supporting user workloads. These hosts can also be used for container caching.
UFM Appliance	NVIDIA Unified Fabric Manager (UFM) for both storage and compute InfiniBand fabric. Manages InfiniBand switches and fabric topology.
NVLink Management Software	NVLink Management Software (NetQ) is an integrated platform for management and monitoring of NVLink connections. Available on GB300 and GB200 systems.
Admin/User Service Nodes	Kubernetes control plane nodes that host infrastructure services (admin space) and user workload orchestration services (user space). Configuration varies by architecture - see architecture-specific sections below.

Mission Control Architecture Overview#

Mission Control consists of three primary operational planes:

Control Plane#

The control plane manages cluster operations and provides administrative interfaces. It includes:

Base Command Manager (BCM): Central management platform with GUI, CLI, and API interfaces
Kubernetes Infrastructure: Container orchestration for services and workloads
Management Services: Monitoring, observability, health checking, and automated recovery

Architecture-Specific Notes:

B200: x86-based BCM Head Nodes, Admin Kubernetes Nodes
B300: x86-based BCM Head Nodes and Run:AI Management Nodes
GB300: Arm64-based BCM Head Nodes with x86 Admin Kubernetes Nodes
GB200: Separated Admin Control Plane (x86) and User Control Plane (Arm64/x86 hybrid)

User Access Plane#

The user access plane provides interfaces for job submission and workload management:

Slurm Nodes: Traditional HPC workload submission via Slurm workload manager
User Kubernetes Nodes: Direct access to Kubernetes for containerized workloads (GB300/GB200)
Run:AI Interface: AI-specific workload orchestration and GPU resource management

Architecture-Specific Notes:

B300/B200: x86-based Slurm Nodes for user access
GB300: Arm64-based Slurm Nodes and dedicated User Kubernetes Nodes
GB200: Arm64 Slurm Nodes with x86 User Service Nodes running Run:AI

Compute Plane#

The compute plane executes workloads on GPU-accelerated resources:

B300: 1SU configuration with 64-72 units per superunit
B200: 1SU configuration with 32 units per superunit
GB300: Rack-based configuration with 8 racks per superunit, organized in compute trays
GB200: Rack-based configuration with 8 racks per superunit, organized in compute trays

Each compute node runs Slurm worker processes and provides GPU resources to scheduled workloads.

Key Components by Architecture#

DGX B200 (v2.2.0)#

Control Infrastructure:

BCM Head Nodes (x86) - x2: Cluster management with GUI, CLI (CMSH), and API interfaces
Admin Kubernetes Nodes (x86) - x3: BCM-integrated infrastructure services including Observability Stack, Autonomous Hardware Recovery (AHR), and Autonomous Job Recovery (AJR)

User Access:

Slurm Nodes (x86) - x2: Job submission interface with BCM-provisioned Slurm software
User Kubernetes Nodes (x86) - x3: Run:AI orchestration (control plane and scheduler), common Kubernetes services (GPU Operator,Loki, Network Operator), and user workloads

Key Characteristics:

CMSH CLI for Base Command Manager operations
Balanced configuration for production AI workloads
BCM-provisioned Kubernetes and Slurm infrastructure
Separated admin and user control planes for enhanced security and operational isolation
Admin plane (x86) handles all infrastructure services and system management
User plane ( x86) provides workload submission and AI orchestration
Autonomous Hardware Recovery (AHR) and Autonomous Job Recovery (AJR) on admin plane

DGX B300 (v2.1.0)#

Control Infrastructure:

BCM Head Nodes (x86) - x2: Cluster deployment, management, and monitoring
Run:AI Management Nodes (x86) - x3: AI workload orchestration with integrated Kubernetes and common services
Admin Kubernetes Nodes (x86) - x3: BCM-integrated infrastructure services including AHR, AJR, NetQ, and Observability Stack

User Access:

Slurm Nodes (x86) - x2: Job submission interface with BCM-provisioned Slurm software

Compute:

B300 - 1SU: High-density GPU compute with either 72 units per SU (InfiniBand configuration) or 64 units per SU (Spectrum Ethernet configuration)

New Features in v2.1.0:

NetQ network monitoring integration
Enhanced Observability Stack for comprehensive system monitoring
Autonomous Hardware Recovery (AHR) and Autonomous Job Recovery (AJR) capabilities
Separate Admin Kubernetes Nodes for infrastructure services

DGX B200 (v2.0.0)#

Control Infrastructure:

BCM Head Nodes (x86) - x2: Cluster management with GUI, CLI (CMSH), and API interfaces
Run:AI Management Nodes (x86) - x3: AI workload scheduling with integrated Kubernetes services including Run:AI control plane, scheduler, and common Kubernetes services (GPU Operator, Network Operator)

User Access:

Slurm Nodes (x86) - x2: Job submission interface with BCM-provisioned Slurm software

Compute:

B200 - 1SU: Production-scale GPU compute with 32 units per SU

Key Characteristics:

Streamlined architecture focused on core orchestration without separate admin infrastructure nodes
CMSH CLI for Base Command Manager operations
Balanced configuration for production AI workloads
Run:AI Management Nodes serve dual purpose: AI orchestration and common Kubernetes services
BCM-provisioned Kubernetes and Slurm infrastructure

DGX GB300 (v2.1.0)#

Control Infrastructure:

BCM Head Nodes (Arm64) - x2: Cluster management with CMSH CLI on Arm64 architecture
Admin Kubernetes Nodes (x86) - x3: BCM-integrated services including AHR, AJR, NetQ, NMX, and Observability Stack

User Access:

Slurm Nodes (Arm64) - x2: Job submission interface on Arm64 architecture
User Kubernetes Nodes (Arm64) - x3: Run:AI orchestration and user-space Kubernetes workloads on Arm64 architecture

Compute:

GB300 Rack: 8 racks per SU organized in compute trays with CPU/GPU units in NVL72 configuration

Key Characteristics:

Hybrid Arm64/x86 architecture: Arm64 for BCM head nodes and user access, x86 for admin infrastructure services
NVLink switches for high-speed GPU interconnect within and across racks
Customer-provided BMS integration via API-compliant interface
Advanced leak monitoring and control capabilities
Rack and inventory management for NVL72 systems
NetQ for comprehensive network fabric monitoring

DGX GB200 (v2.0.0)#

Admin Control Plane:

Head Nodes (x86) - x2: Cluster deployment and management
Admin Service Nodes (x86) - x3: BCM-integrated infrastructure services including NetQ, Observability Stack, Autonomous Hardware Recovery (AHR), and Autonomous Job Recovery (AJR)

User Control Plane:

Slurm Nodes (Arm64) - x2: Job submission interface on Arm64 architecture
User Service Nodes (x86) - x3: Run:AI orchestration (control plane and scheduler), common Kubernetes services (GPU Operator, DRA, Network Operator), and user workloads

Compute:

GB200 Rack: 8 racks per SU in compute tray configuration with NVL72 architecture

Key Characteristics:

Separated admin and user control planes for enhanced security and operational isolation
Admin plane (x86) handles all infrastructure services and system management
User plane (Arm64 + x86) provides workload submission and AI orchestration
Autonomous Hardware Recovery (AHR) and Autonomous Job Recovery (AJR) on admin plane
NetQ for NVLink fabric management and monitoring
NVLink switches for high-speed GPU communication within compute trays
Customer-provided BMS integration via API-compliant interface
Advanced leak monitoring and control capabilities
Rack and inventory management for NVL72 systems

Infrastructure Services#

Base Command Manager (BCM)#

BCM provides comprehensive cluster management capabilities:

Management Interfaces:

GUI: Web-based Base Command View for visual cluster administration
CLI: Command-line interface (CMSH) for scripting and automation
API: RESTful API for programmatic integration

Core Capabilities:

OS provisioning and deployment
Firmware and software updates
Network provisioning and configuration
Health checking and monitoring
Inventory management
Power profile management
Integration with observability tools

Architecture Notes:

Available on all architectures with architecture-specific CLI variants
x86-based on B300/B200 and GB200 admin plane
Arm64-based on GB300 head nodes

Kubernetes Services#

Mission Control leverages Kubernetes for service orchestration:

BCM-Provisioned Kubernetes:

All architectures include BCM-managed Kubernetes infrastructure for system services.

Common Kubernetes Services:

GPU Operator for GPU resource management
Network Operator for network fabric configuration
Loki for log aggregation
Prometheus for metrics collection
Additional operators as needed for the specific architecture

Architecture-Specific Deployment:

B300: Run:AI Management Nodes and Admin Kubernetes Nodes
B200: Admin Kubernetes Nodes (x86) and User Kubernetes Nodes (X86)
GB300: Admin Kubernetes Nodes (x86) and User Kubernetes Nodes (Arm64)
GB200: Admin Service Nodes and User Service Nodes with separated control

Run:AI Orchestration#

Run:AI provides AI-specific workload management:

Capabilities:

Control plane for AI workload scheduling
Intelligent GPU resource allocation
Scheduler for job prioritization and fairness
Integration with common K8s services
Workload monitoring and optimization

Deployment:

Runs on dedicated Run:AI Management Nodes (B300)
Integrated into User Kubernetes Nodes (B200/GB300)
Deployed on User Service Nodes (GB200)

Slurm Workload Manager#

BCM-provisioned Slurm enables traditional HPC workflows:

Features:

Job scheduling and resource allocation
Queue management
Integration with existing HPC environments
Slurm submission software on dedicated Slurm nodes

Architecture Deployment:

x86 Slurm Nodes on B300/B200
Arm64 Slurm Nodes on GB300/GB200
BCM-provisioned Slurm workflow software

Advanced Features#

Autonomous Hardware Recovery (AHR)#

AHR provides automated hardware fault detection and recovery:

Capabilities:

Continuous hardware health monitoring
Automatic fault detection and isolation
Self-healing capabilities for recoverable issues
Integration with BCM for administrative actions

Availability:

B200 (v2.2.0): Available on Admin Kubernetes Nodes
B300 (v2.1.0): Not available in this architecture
GB300 (v2.1.0): Available on Admin Kubernetes Nodes
GB200 (v2.0.0): Available on Admin Service Nodes
B200 (v2.0.0): Not available in this architecture

Autonomous Job Recovery (AJR)#

AJR enables automatic job restart and recovery:

Capabilities:

Job state monitoring
Automatic job restart on recoverable failures
Checkpoint and restart support
Integration with Slurm and Kubernetes schedulers

Availability:

B200 (v2.2.0): Available on Admin Kubernetes Nodes
B300 (v2.1.0): Not available in this architecture
GB300 (v2.1.0): Available on Admin Kubernetes Nodes
GB200 (v2.0.0): Available on Admin Service Nodes
B200 (v2.0.0): Not available in this architecture

Observability Stack#

Comprehensive monitoring and observability infrastructure:

Components:

Metrics collection and aggregation
Log management and analysis
Health and performance dashboards
Alert management and notification

Availability:

B200 (v2.2.0): On Admin Kubernetes Nodes
B300 (v2.1.0): Not available in this architecture
GB300 (v2.1.0): On Admin Kubernetes Nodes
GB200 (v2.0.0): On Admin Service Nodes
B200 (v2.0.0): Integrated into Run:AI Management Nodes

Network Management#

NetQ (v2.1.0):

Network fabric monitoring and troubleshooting for modern data center networks.

B300 (v2.1.0): Available on Admin Kubernetes Nodes
GB300 (v2.1.0): Available on Admin Kubernetes Nodes
Not available on v2.0.0 architectures (B200, GB200)

NetQ:

NVLink fabric management and configuration for NVL72 rack systems.

GB300 (v2.1.0): Available on Admin Kubernetes Nodes
GB200 (v2.0.0): Available on Admin Service Nodes
Not applicable to B-series systems (B300, B200)

NVLink Switches:

High-speed GPU interconnect for direct GPU-to-GPU communication.

GB300: NVLink switches for GPU communication within and across racks
GB200: NVLink switches integrated with network fabric within compute trays
Not applicable to B-series systems which use traditional InfiniBand/Ethernet interconnect

Network and Storage Infrastructure#

Networking#

Mission Control integrates with enterprise network infrastructure:

InfiniBand:

IB Switches and Unified Fabric Manager (UFM)
High-bandwidth, low-latency interconnect for HPC and AI workloads
Available on all architectures

Ethernet:

Ethernet switches for management and data networks
Available on all architectures

NVLink:

NVLink switches for GPU interconnect (GB300, GB200)
High-speed GPU-to-GPU communication within compute trays

Storage Systems#

NFS Storage:

Network File System for shared storage
Home directories, shared datasets, and application data
Available on all architectures

HFS Storage:

High-performance file system for demanding workloads
Optimized for large-scale data processing
Available on all architectures

Additional Systems#

Customer-Provided BMS:

Baseboard Management System integration for GB-series systems (GB300, GB200)
API-compliant BMS for enhanced hardware management and control
Customer-provided component that integrates with BCM via API interface
Not applicable to B-series systems (B300, B200)

Administrative Tasks#

Common administrative tasks covered in this guide include:

Deployment and Configuration:

Initial cluster deployment
Network and storage configuration
User access setup
Service configuration and tuning

Monitoring and Maintenance:

System health monitoring
Performance analysis
Log management
Firmware and software updates

User Management:

User account provisioning
Access control configuration
Resource quota management
Job submission access

Troubleshooting:

Diagnostic procedures
Common issues and resolutions
Log analysis
Hardware fault isolation

Architecture Selection Guide#

Choose the appropriate architecture based on your requirements:

DGX B300:

Highest GPU density: 72 units per SU (InfiniBand) or 64 units per SU (Ethernet)
Advanced monitoring with NetQ (v2.1.0)
Autonomous recovery features (AHR/AJR) in v2.1.0
Traditional DGX node architecture in standard racks
x86-based infrastructure throughout
Separate Admin Kubernetes Nodes for infrastructure services

DGX B200:

Balanced production configuration: 32 units per SU
Proven v2.2.0 architecture
Streamlined deployment without separate admin infrastructure nodes
Traditional DGX node architecture in standard racks
x86-based infrastructure throughout
Cost-effective for production AI workloads

DGX GB300:

Next-generation NVL72 rack-based system
Hybrid Arm64/x86 architecture: Arm64 for control and user access, x86 for admin services
NVLink high-speed interconnect for GPU fabric
Separate User Kubernetes Nodes for user workloads
Advanced features: NetQ, NMX, AHR, AJR (v2.1.0)
Compute tray organization with 8 racks per SU

DGX GB200:

Enterprise NVL72 rack-based system with highest isolation
Separated admin and user control planes for security
Hybrid Arm64/x86 architecture: x86 admin plane, Arm64 Slurm nodes, x86 user services
Advanced autonomous recovery (AHR, AJR) in v2.0.0
NetQ for NVLink fabric management
Compute tray organization with 8 racks per SU
Ideal for multi-tenant environments requiring strong isolation

Document Conventions#

This guide uses the following conventions:

Bold text: UI elements, buttons, menu items
Monospace text: Commands, file paths, configuration values
Italic text: New terms, emphasis

Note

Architecture-specific procedures are clearly marked throughout the guide.

Warning

Always verify compatibility with your specific hardware configuration before making changes.

Tip

Consult the release notes for version-specific features and known issues.