NVIDIA Mission Control 2.1.0 Release Notes#
Overview#
NVIDIA Mission Control 2.1.0 delivers important enhancements to firmware management, observability, and security capabilities for NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 – from OEMs and NVIDIA DGX – as well as NVIDIA DGX B200/B300. This release expands autonomous hardware recovery support to NVIDIA GB300 NVL72, introduces comprehensive firmware upgrade capabilities across critical system components, and provides enhanced Grafana dashboards with improved inventory visibility.
Key Highlights:
Support for NVIDIA GB300 NVL72 and DGX GB300 system (GA).
Support for DGX B300 system (GA).
Extended autonomous hardware recovery support for NVIDIA GB300 NVL72 systems with firmware upgrades, prolog health checks, and OEM enablement improvements.
Comprehensive firmware upgrade capabilities for compute trays, switches, Mellanox, and NVOS via autonomous hardware recovery.
Advanced BCM capabilities specific to NVIDIA Mission Control for DGX B300.
NetQ NVL support (NMX-M) for GB300 NVL72 and DGX GB300 systems.
Enhanced Grafana dashboards with improved rack navigation and new inventory views for NVIDIA GB200 NVL72 Systems.
Strengthened Kubernetes security with network policies and Kyverno capability restrictions.
NVIDIA Run:ai 2.23 support across NVIDIA GB200/GB300 NVL72 systems, DGX GB200/GB300 systems, and DGX B200/B300.
Continued support for DGX B200 and NVIDIA GB200 NVL72 Systems.
What’s New#
Platform Support#
NVIDIA Mission Control 2.1.0 extends support across multiple platform generations, including:
NVIDIA GB300 NVL72 and DGX GB300 Systems: First customer support (GA) with comprehensive firmware management and networking support.
NVIDIA DGX B300: First customer support (GA) with BCM, Run:ai, and InfiniBand (IB) support; Spectrum X support available to select customers.
NVIDIA GB200 NVL72 and DGX GB200 Systems: Continued support, including OEM configurations.
NVIDIA DGX B200: Continued support.
NVIDIA GB300 NVL72 Systems and DGX GB300 Platform Support#
NVIDIA Mission Control 2.1.0 introduces comprehensive support for GB300 NVL72 systems:
BCM for Firmware Updates: Complete firmware update management, rack management support.
NetQ NVL Support (NMX-M): Network visibility and monitoring for NVIDIA GB300 NVL72 systems.
Autonomous Hardware Recovery support for Firmware Updates: Autonomous hardware recovery with firmware upgrades and prolog health checks.
NVIDIA DGX B300 Platform Support#
NVIDIA Mission Control 2.1.0 introduces support for DGX B300 systems with the following capabilities:
NVIDIA Base Command Manager Support: Full Base Command Manager (BCM) integration for DGX B300 cluster management.
NVIDIA Run:ai Support: Workload orchestration and GPU resource management for DGX B300.
InfiniBand (IB) Support: General availability. Provides high-speed, low-latency GPU networking.
Spectrum X Support: Advanced networking capabilities (Directed Availability only).
NVIDIA Run:ai 2.23 Support#
NVIDIA Mission Control 2.1.0 now supports NVIDIA Run:ai 2.23 across NVIDIA GB200/GB300 NVL72 systems and DGX B200/B300 platforms, enabling advanced k8s workload orchestration and GPU resource management across all supported systems.
NVIDIA NMX-Manager (NetQ) 5.0.1#
Updated to NMX/NetQ 5.0.1 providing enhanced network visibility, monitoring, and troubleshooting capabilities for supported platforms.
NVIDIA Mission Control- autonomous hardware recovery-28.4.129 and autonomous hardware recovery-Config files and Runbooks-2.0.9#
NVIDIA Mission Control autonomous hardware recovery feature capability now provides expanded functionality for upgrading, cycling, and verifying firmware and the corresponding OS across both NVIDIA GB200/GB300 NVL72 systems.
Firmware Upgrades via Autonomous Hardware Recovery#
The firmware upgrade process now supports the following components:
Compute trays: Automated firmware management for compute tray components.
Switches: Firmware upgrade capabilities for network switches.
Mellanox: Support for Mellanox networking component firmware updates.
NVOS: Operating system firmware management and verification.
Known Issues#
In MRT2, the HPL Performance run result might show failure even though the MR_HPL_TEST_BURN_IN run finished successfully.
NVIDIA GB300 NVL72 Systems Support#
Autonomous hardware recovery capabilities have been extended to NVIDIA GB300 NVL72 systems, providing the same level of automated firmware management and recovery workflows available for NVIDIA GB200 NVL72 Systems.
Grafana Visualizations (27.1.0)#
NVIDIA Mission Control now includes powerful dashboard upgrades for NVIDIA GB200 NVL72 environments, providing enhanced visibility and navigation capabilities.
Key Enhancements:#
Simplified Navigation: Rack selection is now based on the official BCM device inventory, reducing confusion and aligning dashboard views with how racks are actually assigned and managed.
Expanded Network Monitoring: Dedicated panels for Unified Fabric Manager (UFM) provide better oversight of network activity and potential bottlenecks.
Comprehensive Hardware Overview: A new Inventory Dashboard consolidates hardware and firmware information in one place, making it faster to confirm what’s deployed and up to date.
Consistent Look and Feel: All dashboards have been visually aligned for a more unified and professional presentation, making them easier to use and interpret across teams.
Details:#
Dashboards with a “Rack” dropdown have been updated to use names as defined in the “Racks” area of BCM, instead of assuming a specific hostname format.
Panels to monitor the Infiniband UFM have been added to the “08. Network” dashboard.
A new “11. Inventory” dashboard uses donut charts to show the distribution of part numbers and firmware versions for key devices in the cluster.
Various panels have been reworked to use other metric sources and formulas to better represent the underlying cluster behavior.
Known Issues:#
The “03. BMS” dashboard depends on having the BMS integration with BCM configured and the data matching specifications of our expected data catalog.
Known Limitations#
Known Issue 1: Energy-Optimized Workload Power Profiles Support#
Description: Energy-optimized workload power profiles are not supported for NVIDIA GB300 NVL72 systems in this release.
Affected Versions: Current release for GB300 NVL72 systems.
Impact: Feature capability is unavailable until dependencies are met.
Workaround: None available at this time.
Resolution: Support for energy-optimized workload power profiles depends on MaxQ Power profile functionality, planned for system software version 2.0.0 in a future release.
Additional Notes:
Validation requirements scheduled for 1H 2026 include:
System software update validation for GB300 NVL72.
End-to-end validation of energy-optimized workload power profiles on GB300 NVL72 systems.
Feature will be enabled in a future release following successful validation.
Known Issue 2: Leak Detection Compatibility#
Description: In DGX GB300 NVL72 environments, backward compatibility for leak detection without broadcast communication may fail. This can affect switch leak policies and inhibit proper detection under certain configurations.
Affected Versions: DGX GB300 NVL72 Systems software and associated BCM-11.31.0 Redfish API endpoints.
Impact: This issue does not block the current release of BCM-11.31.0/NVIDIA Mission Control-2.1.0.
Workaround: On BCM head nodes, perform the following steps:
Verify the script location:
Run:
which cm-manipulate-advanced-config.pyThis checks if the script is available in your system path and shows its location.
Then run:
ls -l $(which cm-manipulate-advanced-config.py)This lists details about the script, including permissions and size, to confirm it is executable.
Load the required module (if the script is not found):
Run:
module load cmdThis ensures the cmd module is loaded, which provides access to the required script and related utilities.
Update leak detection configuration:
Run:
cm-manipulate-advanced-config.py "RedFishServiceLeakMessageStrings=LeakDetector|Leak_Detector|leakage|Leakage"This command updates the configuration to include all relevant property strings for leak detection.
Update the configuration file:
Confirm that the changes are reflected in
/cm/local/apps/cmd/etc/cmd.conf.You can open the file using
cat /cm/local/apps/cmd/etc/cmd.confor a text editor to verify the updated property strings.
Restart the CMD service:
Run:
service cmd restartThis applies the updated configuration by restarting the service.
Resolution: A permanent fix is planned for an upcoming BCM firmware release.
Additional Notes: Automated validation tools did not flag this due to unnoticed changes in property strings; improvements to validation coverage are being investigated.
Kubernetes Security Updates (2.0.8)#
Mission Control now includes enhanced security capabilities for Kubernetes environments through two new helm charts and additional manifests for network security and API capability restrictions.
Network Policies Chart:#
Additional Calico policies can be applied to both user (k8s-system-user) and admin (k8s-system-admin) Kubernetes clusters to control network traffic. The policies are different for each cluster based on the services running in each environment.
Kyverno Policies Chart:#
Hardening Kubernetes API to restrict capabilities to only those services which need them. User and admin cluster specific Kyverno policies are provided as part of the helm chart to enforce security best practices and limit potential attack surfaces.
DGX B300 SPX2 Manifest:#
A new manifest for DGX B300 SPX2 Kyverno policy has been added to be applied in the Run.ai Kubernetes cluster, providing hostnetwork policy enforcement specific to DGX B300 architecture requirements.
Kubernetes Manifests:#
The Kubernetes manifests from Mission Control 2.0 continue to be provided with the addition of the new DGX B300 manifest, ensuring backward compatibility while supporting new platform requirements.
NVIDIA Base Command Manager 11.31: New Features#
NVIDIA Mission Control includes Base Command Manager 11 for core cluster management capabilities. A new version of Base Command Manager is now available and validated with NVIDIA Mission Control 2.1.0.
General Updates:#
NV Platform Info Support: Added NV platform information support and cmsh command for enhanced platform visibility and management.
Rack-Level Leak Policy: Changed default leak policy to rack level for improved resource management and isolation.
Slurm Updates: Updated Slurm versions to 25.05.5 (from 24.11) and 24.11.7, providing improved workload scheduling and management capabilities.
Topograph 3.5.0: Updated Topograph to version 3.5.0, enhancing topology visualization and network mapping capabilities.
CUDA 13.0.2: Updated CUDA toolkit to version 13.0.2 for improved GPU computing performance and compatibility.
PRS Setup for Slurm 25.05: Enabled PRS (Power Resource Scheduler) setup support for Slurm 25.05, allowing advanced power management for workloads.
Network Operator 25.10: Updated Network Operator to version 25.10 for enhanced networking capabilities and performance.
NVIDIA Mission Control 2.1.0 Software Components#
NVIDIA Base Command Manager - 11.31.0 (GB200/GB300 NVL72, DGX B200/ B300)
NVIDIA Run:ai - 2.23 (GB200/GB300 NVL72, DGX B200/B300)
NVIDIA NMX Manager - 85.1.2000 (GB200 NVL72 only)
NetQ - 5.0.1 (GB200 NVL72, GB300 NVL72)
Grafana Visualizations - 27.1.0 (GB200 NVL72 only)
Kubernetes Security Policies - 2.0.7 (GB200/GB300 NVL72, DGX B200/B300)
autonomous recovery engine, Includes:
autonomous job recovery - 1.3.1 (only on GB200 NVL72)
autonomous hardware recovery - 28.4.129 (only on GB200/GB300 NVL72)
autonomous hardware recovery-Config files and Runbooks-2.0.9
NVIDIA DGX GB200 System Software and Firmware-1.3.2 (recommended)
NVIDIA DGX GB300 System Software and Firmware-1.0.1 (recommended)
NVIDIA Mission Control 2.1.0 - Software Bill of Materials (SBOM) for NVIDIA GB200/GB300 NVL72#
NVIDIA Mission Control 2.1.0 - Software Bill of Materials (SBOM) for DGX B200/B300#
Where to Download and Install Mission Control 2.1.0#
Product Documentation#
Released NVIDIA Mission Control 2.1.0 Documentation, featuring updates across recovery engines, telemetry dashboards, and NVIDIA GB200/GB300 NVL72 system support.