Mission Control autonomous hardware recovery 28.4 Release Notes#
This release 28.4 includes several key platform enhancements:
Failover: Mission Control deployments now feature failover capabilities with automatic periodic backups and streamlined manual recovery processes.
Job and Run Cancellation: Cancel running jobs or remediation runs that are no longer needed. This critical control feature prevents resource waste and allows immediate course correction during testing, diagnostics, or inadvertent executions—saving valuable time and computational resources.
Secret Store: Mission Control autonomous hardware recovery integrates with BCM for resource discovery and authentication. The secrets used to talk to the BCM services are now persisted within encrypted data stores.
AHR Runbooks Release Notes (v1.0.4)#
Break/Fix#
NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB200. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create support tickets for the issues that cannot be auto-resolved.
The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.
Post RMA#
The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.
Firmware Upgrades#
NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB200 racks. The four distinct components for which firmware can be upgraded using this process are:
Compute trays
Switches
Mellanox
NVOS
Improved Health Checks#
Health checks and alarm tuning continue to better reflect the needs and thresholds of the system. In this release, we’ve removed a number of checks that have no impact on GPU workloads, and have also improved the efficiency at which many operate.