NVIDIA Mission Control 2.2.0 Release Notes#
Overview#
NVIDIA Mission Control 2.2.0 brings configuration and feature parity with other DGX systems to the NVIDIA DGX B200, extending comprehensive cluster management capabilities to the new platform. This release includes Autonomous Resiliency Engine support with Slurm scheduler integration for workload-aware recovery, Grafana dashboard visualizations, and enhanced firmware management for DGX B200, while maintaining continued support for DGX B300 (Base Command Manager and Run:ai only), and NVIDIA GB200/GB300 NVL72 systems.
Key Highlights:
Comprehensive NVIDIA Mission Control support for DGX B200: Includes autonomous hardware recovery, autonomous job recovery with Slurm scheduler integration, and Grafana visualizations.
Mission Control LaunchPad: New unified web interface for centralized access to all Mission Control components.
Enhanced Kubernetes Security: Updated policies supporting the new Mission Control LaunchPad interface.
Domain Power Service (DPS) - Early Preview: Comprehensive power management system for monitoring and optimizing data center power utilization (limited early preview access).
Continued platform support: DGX B300 (Base Command Manager, Run:ai, Mission Control LaunchPad access, and Kubernetes Security Policies only) and NVIDIA GB200/GB300 NVL72 systems (full feature set including Base Command Manager, Run:ai, NetQ, autonomous hardware recovery, Grafana Visualizations, and Kubernetes Security Policies) features.
Platform Support#
NVIDIA Mission Control 2.2.0 supports the following platforms:
DGX B200#
New in this release#
Autonomous Resiliency Engine:
autonomous job recovery (v1.5)
autonomous hardware recovery (v28.4.133)
autonomous hardware recovery- config files and runbooks (v2.2.0)
NVIDIA Resiliency Extensions (NVRx) (v0.4.1)
Grafana Visualizations (v27.1.7)
Domain Power Service (v0.7.8) - Early Preview
Mission Control LaunchPad access (v0.2.11)
Supported features#
Base Command Manager (v11.31.0)
Run:ai (v2.23)
Kubernetes Security Policies (v2.0.9)
DGX B300#
New in this release#
Mission Control LaunchPad access (v0.2.11)
Domain Power Service (v0.7.8) - Early Preview
Supported features#
Base Command Manager (v11.31.0)
Run:ai (v2.23)
Kubernetes Security Policies (v2.0.9)
NVIDIA GB200 NVL72#
New in this release#
Mission Control LaunchPad access (v0.2.11)
Domain Power Service (v0.7.8) - Early Preview
Supported features#
Base Command Manager (v11.31.0)
Run:ai (v2.23)
Autonomous Resiliency Engine:
autonomous job recovery (v1.5)
autonomous hardware recovery (v28.4.133)
autonomous hardware recovery- config files and runbooks (v2.2.0)
Grafana Visualizations (v27.1.7)
NetQ (v5.0.1)
Kubernetes Security Policies (v2.0.9)
NVIDIA GB300 NVL72#
New in this release#
Mission Control LaunchPad access (v0.2.11)
Domain Power Service (v0.7.8) - Early Preview
Supported features#
Base Command Manager (v11.31.0)
Run:ai (v2.23)
autonomous hardware recovery (v28.4.133)
NetQ (v5.0.1)
Kubernetes Security Policies (v2.0.9)
Feature Updates#
Autonomous Resiliency Engine#
NVIDIA Mission Control-autonomous job recovery#
Full support for NVIDIA GB200 NVL72 systems and DGX B200 systems with identical feature sets across both platforms.
Scheduler support: Integrated with Slurm workload manager for automated job monitoring, failure detection, and recovery operations.
Expanded NVIDIA GB200 NVL72 specific failure attributions, thereby enhancing job resilience.
Scheduler integration: Autonomous job recovery integrates with Slurm to provide:
Automated job failure detection and analysis.
Intelligent job requeue and recovery workflows.
Workload-aware health monitoring and diagnostics.
NVIDIA Mission Control-autonomous hardware recovery#
NVIDIA Mission Control 2.2.0 extends autonomous hardware recovery capabilities to DGX B200 systems, providing the same automated recovery and firmware management workflows available for NVIDIA GB200/GB300 NVL72 systems.
Autonomous hardware recovery support for DGX B200 Platform Support includes:
Break/Fix Recovery: Automated detection and recovery from hardware failures.
Post-RMA Operations: Streamlined hardware replacement and system reintegration workflows.
Firmware Upgrade Management: Automated firmware update orchestration across system components.
Baseline Health Checks: Comprehensive system validation and health verification.
Grafana Visualizations#
The following comprehensive monitoring dashboards are now available for DGX B200, providing real-time visibility into system health, power consumption, network performance, and job execution metrics:
DGX System Overall monitoring Dashboard: Comprehensive cluster-wide overview and status.
GPU Health and Usage Dashboard: GPU health, utilization, and performance metrics.
DGX Health Status Dashboard: System health monitoring and diagnostics.
Slurm Dashboard: Workload scheduler monitoring and job tracking.
Alerts Dashboard: Centralized alert management and notification tracking.
Storage Dashboard: Storage system monitoring and capacity tracking.
Inventory Dashboard: Hardware and firmware version tracking with donut chart visualizations.
Kubernetes Security Policies#
Updated security policies compatible with DGX B200/B300, and NVIDIA GB200/GB300 NVL72 systems. Enhanced policies now support the new Mission Control LaunchPad interface while maintaining security best practices across all supported platforms.
Mission Control LaunchPad#
A new unified web interface that provides centralized access to all NVIDIA Mission Control components. The Mission Control LaunchPad simplifies cluster management by consolidating access to Base Command Manager, Run:ai, Grafana Visualizations, and administrative tools for autonomous job recovery and autonomous hardware recovery in a single, intuitive interface.
Domain Power Service (DPS) - Early Preview#
Comprehensive power management system for monitoring and optimizing data center power utilization. Available for select large-scale deployments; contact your NVIDIA representative for access to this early preview capability.
Known Issues and Limitations#
Known Issue 1: Energy-Optimized Workload Power Profiles Support#
Customer Impact: Energy-optimized workload power profiles are not supported for NVIDIA GB300 NVL72 systems in this release.
Workaround: None available at this time.
Resolution: Support for energy-optimized workload power profiles depends on MaxQ Power profile functionality, planned for system software version 2.0.0 in a future release. Feature will be enabled in a future release following successful validation.
Known Issue 2: Leak Detection Compatibility#
Customer Impact: Liquid leak detection alerts may not trigger automatically on NVIDIA GB200/GB300 NVL72 systems. Administrators should manually configure leak detection strings as described below until the permanent fix is available in a future Base Command release.
Workaround Steps on Base Command Manager head nodes#
Verify the script location:
Run the following command to check if the script is available in your system path and to display its location:
which cm-manipulate-advanced-config.pyThen, run the following command to list details about the script (permissions, size, etc.) to confirm it is executable:
ls -l $(which cm-manipulate-advanced-config.py)
Load the required module (if the script is not found):
If the script is not found, load the required module with this command:
module load cmd
This ensures the cmd module is loaded, providing access to the required script and related utilities.
Update leak detection configuration:
Run the following command to update the configuration with all relevant property strings for leak detection:
cm-manipulate-advanced-config.py "RedFishServiceLeakMessageStrings=LeakDetector|Leak_Detector|leakage|Leakage"
Update the configuration file:
Confirm that the changes are reflected in
/cm/local/apps/cmd/etc/cmd.conf.You can open the file to verify the updated property strings:
cat /cm/local/apps/cmd/etc/cmd.confOr you may use a text editor of your choice.
Restart the CMD service:
Apply the updated configuration by restarting the service:
service cmd restart
Resolution: An upcoming BCM software release will include a permanent fix.
NVIDIA Mission Control 2.2.0 Software Components#
NVIDIA Base Command Manager - 11.31.0 (GB200/GB300 NVL72, DGX B200/B300)
NVIDIA Run:ai - 2.23 (GB200/GB300 NVL72, DGX B200/B300)
NetQ - 5.0.1 (GB200 NVL72, GB300 NVL72)
Grafana Visualizations - 27.1.7 (GB200 NVL72, DGX B200)
Kubernetes Security Policies - 2.0.9 (GB200/GB300 NVL72, DGX B200/B300)
Autonomous Recovery Engine, Includes:
autonomous job recovery - 1.5.0 (GB200 NVL72, DGX B200)
autonomous hardware recovery - 28.4.133 (GB200/GB300 NVL72, DGX B200)
autonomous hardware recovery-Config files and Runbooks - 2.2.0 (GB200/GB300 NVL72, DGX B200)
Note
NVIDIA Resiliency Extensions (NVRx) is not bundled with Mission Control. It must be installed separately. Please follow the installation instructions at: https://nvidia.github.io/nvidia-resiliency-ext/
NVIDIA Mission Control 2.2.0 - Software Bill of Materials (SBOM) for DGX B200/B300 Systems#
NVIDIA Mission Control 2.2.0 - Software Bill of Materials (SBOM) for NVIDIA GB200/GB300 NVL72 Systems#
Software Downloads, Installation, and Activation#
Product Documentation#
Released NVIDIA Mission Control 2.2.0 Documentation, featuring updates across recovery engines, telemetry dashboards DGX B200/B300 and NVIDIA GB200/GB300 NVL72 systems support.