NVIDIA Mission Control 2.3.0 Release Notes#
Overview#
NVIDIA Mission Control 2.3.0 extends platform support to DGX B300 and NVIDIA GB300 NVL72 systems with full autonomous resiliency capabilities, introduces air-gapped deployment support, adds new Grafana monitoring dashboards, and delivers updates across all Mission Control components including Base Command Manager v11.32.0, Domain Power Service v0.7.10, and Mission Control-LaunchPad v0.2.13.
Key Highlights:
Expanded platform support: Full autonomous hardware and job recovery capabilities extended to DGX B300 and NVIDIA GB300 NVL72 systems.
Air-Gapped Deployment Support: Support for air-gapped deployments introduced in Mission Control 2.3.0.
Enhanced Grafana Visualizations v27.3.2: New dashboards for NVIDIA GB300 NVL72 systems and Beta availability for DGX B300 systems including BMS, Rack Power Analysis, Network, and NVLink dashboards.
Mission Control LaunchPad v0.2.13: Adds Keycloak authentication integration and Helm chart enhancements for dynamic UI customization.
Base Command Manager v11.32.0: Updated core components, new HPC networking packages, decoupled TUI wizards, and GB300 rack support in Base View.
Domain Power Service v0.7.10: Built-in health-check CLI, standardized Helm resource naming, and multiple stability improvements.
Platform Support#
Capability |
DGX B200* |
DGX B300* |
DGX GB200 |
OEM GB200 NVL72 |
DGX GB300 |
OEM GB300 NVL72 |
|---|---|---|---|---|---|---|
NVIDIA Base Command Manager (11.32.0) with DGX OS 7.4.0 |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Run:ai (2.23) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Mission Control-Autonomous Resiliency Engine |
||||||
NVIDIA Mission Control-autonomous job recovery (1.5.0) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Mission Control-autonomous hardware recovery (29.1.82) and NVIDIA Mission Control-autonomous hardware recovery-Config files and Runbooks (2.3.26) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Resiliency Extensions (0.4.1) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Observability Stack |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Mission Control-Grafana Dashboards (27.3.2) |
Yes |
Beta |
Yes |
Yes |
Yes |
Yes |
NVIDIA NetQ (5.1.0) |
N/A |
N/A |
Yes |
Yes |
Yes |
Yes |
NVIDIA Mission Control-LaunchPad (0.2.13) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
NVIDIA Domain Power Service (0.7.10) (Early Preview) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Kubernetes Security Policies (2.0.12) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Air-Gapped Deployment |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Feature Updates#
Autonomous Resiliency Engine#
NVIDIA Mission Control - autonomous job recovery#
Autonomous Job Recovery v1.5.0 delivers the following updates:
Adds support for DGX B300 and NVIDIA GB300 NVL72 systems, with Slurm scheduler integration enabling automated job monitoring, failure detection, and recovery.
Continued full support for NVIDIA GB200 NVL72 and DGX B200 systems with identical feature sets across both platforms.
NVIDIA Mission Control - autonomous hardware recovery#
Autonomous Hardware Recovery v29.1.82 and Runbooks v2.3.26 include the following updates:
DGX B300 support: Baseline runbooks are now supported for B300, including HPL MXP, single-unit and multi-unit testing (SUT/MUT) flows, break/fix, and firmware upgrade.
NVIDIA GB300 NVL72 Systems updates: Break/Fix runbook updates, firmware upgrade integration in Break/Fix flow, re-enablement of all runbooks on GB300, and updated NVBANDWIDTH NIC references.
Switch and NVOS: Improved credential handling across switch-related actions, updated logic for finding active NMX controller, NVOS reinstall and upgrade command improvements, and ZTP settings updates.
Flexible Resource Query (FRQ): Added FRQ support across runbooks;
RACK_NAME/rack_nameparameters replaced with FRQ resource_tag and resource_value.Firmware management: Firmware upgrade parallelization (switch and compute components run simultaneously), Powershelf validation included in SRT1, and improved handling of hosts during Mellanox upgrades.
Platform & API updates: Terraform/OpenTofu provider updated from v1.x to v2.x; AHR user directory is now configurable via Terraform/OpenTofu; API rate limiting and throttling with clear error messaging for 429 responses.
Air-Gapped Deployment: TUI AirGap Support added.
Automated AHR artifact deployment: Runbook Packages and dependency installation.
Support for four new resource types: Switch, Powershelf, power distribution unit (PDU), and cooling distribution unit (CDU).
Self-signed certificates management via TUI wizard
Guided AHR upgrade via TUI wizard
Multiple GPU types break/fix support in the same instance: Deployment of break/fix Runbook Packages for multiple GPU types/generations is now supported within a single AHR instance, eliminating the former restriction to one type.
NVIDIA Resiliency Extension (NVRx)#
Note: NVIDIA Resiliency Extensions (NVRx) is not bundled with Mission Control. It must be installed separately. Please follow the installation instructions at: https://nvidia.github.io/nvidia-resiliency-ext/
Grafana Visualizations#
Grafana Visualizations v27.3.2 delivers an expanded set of monitoring dashboards now available for DGX GB300 and Beta availability for DGX B300, providing real-time visibility into system health, power consumption, network performance, and job execution metrics:
DGX System Overall Monitoring Dashboard: Comprehensive cluster-wide overview and status.
GPU Health and Usage Dashboard: GPU health, utilization, and performance metrics.
BMS Dashboard [Beta]: Rack-level power, cooling, and leak status.
DGX Health Status Dashboard: System health monitoring and diagnostics.
Rack Power Analysis Dashboard: In-depth view of power component health and per-component power utilization.
Slurm Dashboard: Workload scheduler monitoring and job tracking.
Alerts Dashboard: Centralized alert management and notification tracking.
Network Dashboard: Health, inventory, and performance of components throughout InfiniBand and Ethernet fabrics.
NVLink Dashboard: Health, inventory, and performance of components throughout the NVLink fabric.
Storage Dashboard: Storage system monitoring and capacity tracking.
Inventory Dashboard: Hardware and firmware version tracking with donut chart visualizations.
Mission Control LaunchPad#
Mission Control LaunchPad v0.2.13 includes the following updates:
Supports Authentication (Identity Only)#
Keycloak Integration: Added support for Keycloak as the authentication provider for identity verification (AuthN).
User-Provided IdP: Users are responsible for providing their own external Identity Provider (e.g., LDAP, OIDC) to interface with Keycloak.
Scope: This integration supports authentication only; authorization (RBAC/permissions) is not included.
Setup Guidance: Documentation is provided for downloading, installing, and configuring Keycloak within the environment.
Configuration#
Helm Chart Enhancement: Updated values.yaml to support dynamic card rendering in LaunchPad, enabling flexible UI customization via Helm.
Base Command Manager#
NVIDIA Base Command Manager v11.32.0 (released March 12, 2026) includes the following updates:
Updated Components:#
BaseOS 7.4.0 (x86_64 and ARM/aarch64)
Ubuntu 24.04.3
Slurm 25.05.6 and 25.11.3
CUDA 13.1 packages
OpenSSL 3.5.5
Pyxis 0.21.0
Enroot 4.0.1
NVIDIA HPC SDK 25.11
NVIDIA Container Toolkit 1.18.2
New HPC networking packages: cm-hpcx-doca-ofed-cuda12 and cm-hpcx-doca-ofed-cuda13
New Features:#
Decoupled TUI wizards for Domain Power Service, NetQ, Mission Control-autonomous job recovery, and Mission Control-autonomous hardware recovery delivered as separately installable packages, accessible through the existing cm-mission-control-setup interface.
Option in cm-kubernetes-setup to install Kubernetes software in a software image without re-running a full Kubernetes setup.
Improved Topograph integration: CMDaemon stops issuing requests after 50x responses and generates a minimal valid topology configuration when topology fetch fails.
Improved support in minimal configuration compute nodes to allow for offloading Slurm.
CMDaemon no longer writes the deprecated Slurm parameter AccountingStorageUser in slurm.conf.
New cmsh command to view active monitoring tasks: device taskrunnerdetails.
Device status monitoring mode added to Rack View in Base View.
GB300 racks highlighted in Rack View in Base View with a gold color.
Topograph settings added to WLM wizard in Base View.
SSO authentication support for COD-AWS.
Option in cmha to skip replication of Slurm accounting database when not on the head node.
Deprecated or Removed Features#
The Kubernetes Dashboard has been removed due to upstream deprecation.
ingress-nginx is now a deprecated feature in cm-kubernetes-setup.
Support for Grafana promtail installation has been removed from cm-kubernetes-setup.
Kubernetes Security Policies#
Kubernetes Security Policies v2.0.12 includes the following updates:
Added support for LaunchPad and Keycloak authentication traffic.
Updated policies to allow traffic to the database and to LaunchPad.
Domain Power Service (DPS) - Early Preview#
Domain Power Service v0.7.10 includes the following updates:
New Features#
Built-in health-check CLI subcommand for dps-server supporting gRPC and HTTP endpoints with configurable retries, intervals, and timeouts.
Init container on PRS deployment to wait for DPS server readiness via gRPC health check before starting.
imagePullSecrets support for Helm test pods (liveness, readiness, startup).
Resource requests and limits for all health-check init containers and test pods.
Changes#
Standardized all Helm resource names to use dps.fullname helper, enabling multiple DPS instances in the same namespace.
Replaced external container images with internally-owned images in all init containers and test pods.
Templatized gRPC ingress port to use .Values.dps.service.grpc.port instead of hardcoded 8080.
Increased PRS memory request from 384Mi to 512Mi.
Updated E2E test port-forward targets to use release-prefixed service names.
Fixed#
Fixed dps.fullname helper template to properly assign $fullName in all code paths.
Removed empty httpHeaders key from DPS UI liveness probe that could cause issues.
Known Issues and Limitations#
Known Issues: Domain Power Service#
The following known issues exist in Domain Power Service v0.7.10:
NVBUG ID |
Description |
|---|---|
NVBUG-5969875 |
Power distribution data shows via dpsctl but is not displayed by PDN Graphview in the UI. |
NVBUG-5965403 |
UI indicates power policy percent limit accepts 0–100% but rejects any value other than 100%. |
NVBUG-5965415 |
DPS UI allows proceeding to save a Power Policy without limits added, then reports limits as mandatory at the next step. |
NVBUG-5951458 |
DPS UI incorrectly displays “no topology file uploaded” even after a successful upload. |
NVBUG-5951531 |
DPS UI is missing an option to display DPS version info; the only available option is to log out. |
NVBUG-5973217 |
DPS Slurm integration PrologSlurmctld/EpilogSlurmctld scripts block job scheduling, causing jobs to requeue indefinitely. Workaround: Disable DPS prolog/epilog in slurm.conf. |
Energy-Optimized Workload Power Profiles Support#
Customer Impact: Energy-optimized workload power profiles are not supported for NVIDIA GB300 NVL72 systems in this release.
Workaround: None available at this time.
Resolution: Support for energy-optimized workload power profiles depends on MaxQ Power profile functionality, planned for system software version 2.0.0 in a future release. Feature will be enabled in a future release following successful validation.
Known Issues: Grafana Visualizations#
The following known issues exist in nmc-grafana-dashboards v27.3.2:
NVBUG ID |
Description |
|---|---|
NVBUG-6057876 |
Redfish metric collection on B200/B300 platforms can be slow in polling. |
NVIDIA Mission Control 2.3.0 Software Components#
NVIDIA Base Command Manager - 11.32.0 (GB200/GB300 NVL72, DGX B200/ B300)
NVIDIA |runai| - 2.23 (GB200/GB300 NVL72, DGX B200/B300)
NetQ - 5.1.0 (GB200 NVL72, GB300 NVL72)
Grafana Visualizations - 27.3.2 (GB200/GB300 NVL72, DGX B200/B300)
Kubernetes Security Policies - 2.0.12 (GB200/GB300 NVL72, DGX B200/B300)
Domain Power Service (DPS) - Early Preview - 0.7.10
NVIDIA Mission Control-LaunchPad-0.2.13
NVIDIA Mission Control- Air-Gap Tool for DGX B200/B300 and GB200/GB300 NVL72 Systems-1.6.1
Autonomous Recovery Engine, Includes:
autonomous job recovery - 1.5.0 (GB200/GB300 NVL72, DGX B200/B300)
autonomous hardware recovery - 29.1.82 (GB200/GB300 NVL72, DGX B200/B300)
autonomous hardware recovery-Config files and Runbooks-2.3.26 (GB200/GB300 NVL72, DGX B200/B300)
Note
NVIDIA Resiliency Extensions (NVRx) is not bundled with Mission Control. It must be installed separately. Please follow the installation instructions at: https://nvidia.github.io/nvidia-resiliency-ext/
NVIDIA Mission Control 2.3.0 - Software Bill of Materials (SBOM) for DGX B200/B300 Systems#
NVIDIA Mission Control 2.3.0 - Software Bill of Materials (SBOM) for NVIDIA GB200/GB300 NVL72 Systems#
Software Downloads, Installation, and Activation#
Product Documentation#
NVIDIA DGX SuperPOD and BasePOD Deployment Guide with DGX B200 Systems.
Released NVIDIA Mission Control 2.3.0 Documentation, featuring updates across recovery engines and telemetry dashboards, plus DGX B200/B300 and NVIDIA GB200/GB300 NVL72 system support.