NVIDIA Mission Control 2.0.0 Release Notes#

Overview#

NVIDIA Mission Control 2.0.0 delivers a major upgrade to the platform’s fault-tolerance, observability, and orchestration capabilities. This release adds support for NVIDIA Run:ai 2.22 and introduces autonomous recovery engine, a unified resiliency framework for autonomous job and hardware recovery for NVIDIA GB200 NVL72 systems. Enhanced dashboards including a new inventory view improve infrastructure visibility, and a new wizard for NVIDIA Base Command Manager installation streamlines setup and integration workflows.

NVIDIA GB200 NVL72 - including DGX GB200:

  • Support for NVIDIA Run:ai for AI workload orchestration.

  • Autonomous Recovery Engine for system-level resiliency.

  • Telemetry & Observability Enhancements with customizable, pre-configured inventory dashboard providing complete infrastructure visibility..

These updates are designed to streamline operations and improve infrastructure reliability across NVL72 deployments.

NVIDIA DGX B200:

  • Integration with NVIDIA Base Command Manager 11.25.08 for simplified software deployment and inventory visibility.

  • Support for Slurm and NVIDIA Run:ai 2.22 for distributed inference support and smarter workload management.

These updates streamline cluster deployment and enhance uptime for DGX B200 systems.

What’s New#

Autonomous Recovery Engine#

NVIDIA Mission Control integrates three core components: autonomous job recovery, autonomous hardware recovery, and NVIDIA Resiliency Extension (NVRx); to deliver end-to-end resiliency fabric for NVIDIA GB200 NVL72 systems:

Key autonomous job recovery capabilities include:#

  • Root Cause Insights with Fault Attribution and Characterization (FACT) : Quickly understand why issues occurred with automated root cause analysis, helping reduce the downtime and improve system reliability.

  • Job Auto-Recovery: Jobs now resume automatically, isolating problematic nodes to ensure smoother recovery without manual intervention.

  • Consistent Training Logs: Standardized logging formats across training workflows makes it easier to track progress, troubleshoot and maintain audit trails.

  • Intuitive Recovery Dashboard: Autonomous job recovery user interface, referred as cockpit UI, provides clear visibility into recovery actions, including summaries of what went wrong and how it was resolved.

  • Real-time Notifications: Stay informed with timely alerts for recovery actions and system anomalies, enabling swift response and ensuring smooth operations.

Key autonomous hardware recovery capabilities include:#

  • Reliable Failover with Backup Support: System now recovers smoothly from hardware issues with automatic failover and periodic backups, reducing the need for manual intervention.

  • Smarter Resource Management: Jobs and runs can be cancelled proactively to avoid wasting compute resources when hardware issues are detected.

  • Secure Authentication Built-in: Integration with a secret store ensures secure and seamless authentication for Base Command Manager (BCM) operations, enhancing system security.

  • Streamlined Compute and NVLink Switch Tray Recovery Workflows: Simplifies break/fix workflows for NVIDIA GB200 NVL72 tray failures, helping reduce downtime and accelerate resolution.

  • Faster rack-wide recovery: Asynchronous operations enable parallel remediation across racks, speeding up recovery and minimizing impact on workloads.

Key NVIDIA Resiliency Extensions (NVRx) capabilities include:#

  • Asynchronous Checkpointing: Checkpoints are saved in the background, ensuring training continues uninterrupted while preserving progress.

  • Straggler Detection: Identifies slow-performing nodes to maintain consistent throughput and improve overall training efficiency.

  • Framework Integration: Now compatible with popular Deep Learning Frameworks like PyTorch Lightning and NVIDIA NeMo, making it easier to integrate resiliency into existing workflows.

Key telemetry and observability enhancements include:#

NVIDIA Mission Control now includes powerful dashboard upgrades for NVIDIA GB200 NVL72 environments:

  • Simplified Navigation: Rack selection using NVIDIA Base Command Manager (included with NVIDIA Mission Control) is now based on the official device inventory, reducing confusion and aligning dashboard views with how racks are actually assigned and managed.

  • Expanded Network Monitoring: Dedicated panels for Unified Fabric Manager (UFM) providing better oversight of network activity and potential bottlenecks.

  • Comprehensive Hardware Overview: A new ‘Inventory Dashboard’ consolidates hardware and firmware information in one place, making it faster to confirm what’s deployed and up to date.

  • Consistent Look and Feel: All pre-built dashboards have been visually aligned for a unified presentation, making them easier to use and interpret across teams.

Support for NVIDIA Run:ai as a workload manager on NVIDIA GB200 NVL72 and DGX B200#

NVIDIA Mission Control 2.0.0 now supports NVIDIA Run:ai 2.22 on NVIDIA GB200 NVL72 and DGX B200 systems, enabling advanced workload orchestration and GPU resource management. This integration enhances scheduling flexibility, improves utilization, and simplifies AI operations.

New Updates Include:#

  • Distributed Inference Support: Submit distributed inference workloads across multiple nodes with gang scheduling, enabling scalable performance and efficient resource utilization.

  • Seamless Inference Updates: Apply updates to running inference workloads without disruption, ensuring high availability and uninterrupted service.

  • Smarter Workload Management: Flexible workload templates, categorization, and configurable priorities allow teams to standardize operations and align resource usage with business objectives.

  • Enhanced Visibility and Storage Control: Redesigned dashboards provide fine-grained GPU and quota metrics, while shared data volumes can now be created, managed, and shared across projects and departments directly from the UI.

  • Automatic Multi-Node NVLink: Automatic assignment of compute domains on supported hardware (e.g. GB200 NVL72) for distributed training workloads to ensure high performance and low latency.

NVIDIA Base Command Manager 11.25.08: New Features#

NVIDIA Mission Control includes Base Command Manager 11 for core cluster management capabilities. A new version of Base Command Manager is now available and validated with NVIDIA Mission Control 2.0.0. Please refer to the NVIDIA Base Command Manager 11.25.08 release notes for install or upgrade instructions.

NVIDIA Mission Control 2.0.0 empowers operators to:#

  • Minimize downtime with autonomous recovery engine resiliency fabric.

  • Maintain training continuity with fault-tolerant workflows.

  • Diagnose and repair known hardware issues automatically.

  • Gain in-depth visibility into system health and performance.

  • Orchestrate workloads efficiently with NVIDIA Run:ai and Slurm Workload managers fully integrated with NVIDIA Base Command Manager (BCM).

  • Scale confidently across racks, nodes, and users.

Release Notes for each NVIDIA Mission Control 2.0.0 software components:#

NVIDIA Mission Control 2.0.0 - Software Bill of Materials (SBOM) for NVIDIA GB200 NVL72#

NVIDIA Mission Control 2.0.0 - Software Bill of Materials (SBOM) for DGX B200#

Where to Download and Install Mission Control 2.0.0#

Product Documentation#

Released NVIDIA Mission Control 2.0.0 Documentation, featuring updates across recovery engines, telemetry dashboards, and NVIDIA GB200 NVL72 system support.

Additional Release Notes#