Connecting to NVIDIA Mission Control autonomous hardware recovery#
NVIDIA Mission Control autonomous hardware recovery’s authentication relies on BCM’s LDAP authentication so that users will leverage their BCM credentials to login to NVIDIA Mission Control autonomous hardware recovery.
Upon successful authentication, the user session receives a short lived JWT and refresh token. The JWT is used for identity and refreshed by the UI at expiration time.
An administrator can opt to manage users and groups using BCM. Any changes will automatically be reflected within NVIDIA Mission Control autonomous hardware recovery. There are two options to access NVIDIA Mission Control autonomous hardware recovery. The first option is to authenticate using SSO with BCM identity. The other option is to go to the url for the NVIDIA Mission Control autonomous hardware recovery UI. You will be presented with a login screen.
Login screen
Once you click the login button, the authentication page appears where you can enter credentials.
Main landing page
Users are authorized to perform different activities within NVIDIA Mission Control autonomous hardware recovery by configuring permission policies. Policies determine if the user can view resources and execute actions (named and/or anonymous). Action execution can be limited to a maximum number of impacted resources and / or specific resources. Permissions can also be attached to runbooks to allow / disallow certain users or groups.
New users are first created in BCM before they are able to access NVIDIA Mission Control autonomous hardware recovery. Upon logging in, there is a default permission policy that every user is assigned. The permissions of this policy are determined by the administrator. The administrator role has permission to perform any action in NVIDIA Mission Control autonomous hardware recovery. The configurator role has the permission to create, edit, delete any artifacts in NVIDIA Mission Control autonomous hardware recovery. By default, new users are granted administer and configure roles until the privileges are overridden. Defaults can be modified in the Access Control section of the NVIDIA Mission Control autonomous hardware recovery UI.
Dashboard#
NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.
Cluster Validation#
The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to complete racks or multiple rack configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron(Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.
Entrypoint of Automated Testing#
A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one click implementation of all testing procedures.
To access the baseline testing interface:
Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (please reference the interface depicted below)
Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the interface shown below.
More about initializing and customizing these runbooks can be found in the extended Automated Hardware Recovery guide.
Firmware checks#
NVIDIA Mission Control autonomous hardware recovery includes Firmware checks which extracts the current firmware versions of the trays and the switches, and compares it with the expected versions specified in the Golden Config File. The Golden Config file includes the expected version for all the components such as OS, HMC, CX7 etc, and is already prepopulated.
Reports of Testings#
NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.
To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.
The Landing page contains two tabs: Report Templates and Published Reports.
Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.
Health Checks & Alerts#
NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels for GB200. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.
Alarms Dashboard#
The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.
BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#
When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.
Prolog Checks (run before the job starts):
If a check fails, the node is marked as DRAIN, and the job is re-queued.
If it passes, the job proceeds normally.
Epilog Checks (run after the job finishes):
If a check fails, the node is also marked as DRAIN.
These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the Nvidia Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm
in the NVIDIA Mission Control package.
Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:
SLURM_CHECKS_ENABLE
– enables the checksSLURM_CHECKS_DISABLE
– disables the checks
Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_RACK_ENABLE” AND “ALARMS_RACK_DISABLE” runbooks.
Periodic Health Checks (Alarms)#
Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the Nvidia Autonomous Hardware Recovery UI.
Automatically enabled for racks that pass Single Rack Testing
Automatically disabled during firmware upgrade and Break/fix
Can be manually enabled or disabled at any time
The alarm_base_no_switch_on_rack
is the Resource Query they run against. To control them manually, use the following runbooks:
ALARMS_RACK_ENABLE
– enables periodic alarms for selected racks. Please note, a node having themaintenance
tag will override these settings.ALARMS_RACK_DISABLE
– disables them
Periodic Checks are fully visible and configurable in the UI through the alarm section. Below is a list of the configured Alarms, grouped by their check interval:
Frequent Checks (5m)#
bmc_sensors#
Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.
sysmem#
Checks that all expected memory DIMMs are present.
dns_host#
Checks the DNS configuration and resolution for the host.
eth_state#
Checks that the CX7 devices are present, active, and in the physical LinkUp state via ibstat, and also matching the expected transfer rate
raid_count#
Checks that the raid configuration matches the expected mdstat configuration.
gpu_temp_history#
Checks System Event Log (SEL) history looking for GPU temperature issues.
gpu_alloc_temp#
Checks if the GPU temperatures are above a threshold.
sysctl_rp_filter#
Verifies RP Filters are strict.
periodic_bmc_host_checks#
The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms.
check_bmc_ipmi_version - Checks BMC IPMI version against an expected value
check_nvidia_module_loaded - Verifies the NVIDIA module is loaded in the host OS
check_host_os_version - Verifies the DGX OS version matches the expected value
check_nvsm_status - Verify the NVSM service is currently active
periodic_cpu_mem_checks#
check_cpu_health - Verifies CPU sockets and cores are present and online
check_dimm_count - Checks that all expected memory DIMMs are present
check_dimm_size - Checks that the size of each memory DIMM matches the expected values
check_memory_swap_size - Checks that the memory swap size matches the expected value
periodic_gpu_nvlink_checks#
check_gpu_pci - Checks that all GPUs are present on the lspci interface and with the correct link width and speed
check_gpu_error - Checks GPUs for ECC errors, retired pages, and throttles present
check_gpu_powerstate - Checks the powerstate for each GPU and compares against an expected value
check_gpu_param - Checks that specified GPU parameters are present and correct for the host
check_nvlink_health - Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.
check_gpu_topology - Checks that there are no issues with the p2p topology within the node
check_gpu_telemetry - Checks that various sensors can be successfully read from the GPU via nvidia-smi
check_gpu_power_limit - Checks that the power limit is correct for each GPU
check_nvidia_inforom_ver - Checks that the inforom version is correct for each GPU
check_gpu_clock_info - Checks that the maximum clock speed is correct for each GPU
check_remapped_row - Checks if any remapped row events have occurred
periodic_network_checks#
check_ib_ber_and_ro - Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the CX7 using mlxlink
check_ib_port_rcv_errors - Check Infiniband devices port RCV errors
check_ib_cables - Checks the cable info using mlxcables
check_bf3_speed - Validates that the BF3 devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail
periodic_storage_checks#
check_pex_switch_health - Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci
check_cx7_config - Checks that the CX7 devices have the correct PCIe link speed and width via lspci and ACS config via setpci
check_nvme_health - Checks that the PCIe link speed and width of each NMVe device matches the expected value
check_storage_dir - Checks that the host has functional access to the home storage
check_storage_util - Checks that the used local storage on the host is below a given threshold
periodic_error_checks#
Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.
Hourly Checks#
nfs_mounts#
Verifies required mount points.
daily_informational#
Checks for which the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.
check_sel_event - Read the SEL events from the BMC and ensure none are asserted
check_dgx_os_version - Verifies the DGX OS version matches the expected value
check_gpu_vbios_ver - Checks the VBIOS version of the GPUs and compares against an expected value
check_nvme_fw_ver - Checks that the FW version for each NVMe matches an expected value
check_kernel_commandline_opt - Verifies the specified kernel option(s) is present in the current kernel’s boot parameters
check_host_bios_ver - Verifies the system’s BIOS version
check_kernel_ver - Verifies the current version of the Linux kernel
check_host_package_versions - Queries the installed packages on the host
nv_container_cli_info - Retrieves information about the NVIDIA container CLI (driver and devices)
Daily Checks#
cpu_stepping#
Checks that the CPU stepping parameter is correct for each CPU.
numa_node_count#
Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.
NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#
There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval and Automation. Below you can see an example configuration.
Resource Query#
The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the example above, the `hosts | rack_name =~ “.*”` will only check alarms on hosts which have a value set for the “rack_name” tag.
Fire Query#
The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.
Resolve Query#
Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.
Check Interval#
The interval at which the fire and resolve queries are checked.
Alarm States#
If an Alarm has triggered, it will be in one of the following three states:
Triggered#
This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”
Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.
Resolved#
When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.
Canceled#
When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.
Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#
NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB200 racks. The four distinct components for which firmware can be upgraded using this process are:
Compute trays
Switches
Mellanox
NVOS
The workflow invocation is performed via autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search by the FIRMWARE_UPGRADE
label.
NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#
NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB200. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets, when opted in to the support ticket service, for the issues that cannot be auto resolved.
The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.
Key Features#
Automatic Detection: Identifies drained nodes without manual intervention
Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms
Comprehensive Diagnostics: Performs thorough hardware and software checks
Automated Remediation: Attempts to resolve issues without human intervention when possible
Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting
Entrypoint of Break/Fix Workflow#
A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.
Break/Fix Workflow Components#
The break/fix system consists of several key components that work together to diagnose and remediate issues with compute trays. Below is a detailed explanation of each component:
BREAKFIX_TRIGGER#
The entry point runbook that:
Runs automatically every 5 minutes via time trigger
Checks for any drained nodes in BCM
Initiates the triage process for affected nodes
Routes to the appropriate diagnostic workflow
BREAKFIX_COMPUTE_TRAY_TRIAGE#
This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.
GPU_RECOVERY#
This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.
BREAKFIX_COMPUTE_TRAY_VALIDATION#
This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.
BREAKFIX_DIAG_DUMP#
This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.
NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#
The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.
Key Features#
Automated Configuration: Configures replaced hardware components with proper settings
Firmware Updates: Updates firmware to match the required versions for the environment
Boot Order Correction: Ensures proper boot sequence for reliable operation
Comprehensive Validation: Performs thorough testing to verify hardware functionality
Seamless Integration: Automatically returns validated hardware to service
Post RMA Workflow Components#
The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:
Physical Replacement Procedures#
Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling
Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection
Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection
BCM Inventory Update#
ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed
Skip this step for repaired trays as MAC addresses remain unchanged
Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)
Enables proper management and monitoring of the replaced hardware
BMC Credential Management#
Creates necessary BMC credential files for secure access to hardware components
Establishes secure communication channels for configuration operations
BF3 Configuration#
Checks if BF3 devices are in NIC mode
Enables OPROM on BF3 devices to ensure proper initialization
Configures hardware components for optimal operation
Boot Order Correction#
Ensures the boot sequence is properly configured
Prevents boot failures and improves system reliability
Performs power reset through BMC after configuration changes
Connectivity Verification#
Verifies SSH connectivity to compute nodes
Checks BCM device status to ensure proper registration
Confirms network accessibility before proceeding with firmware updates
Firmware Updates#
Compute Firmware: Updates compute firmware using BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA runbook
Mellanox Firmware: Updates BF3 and CX7 firmware using BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA runbook
Ensures all hardware components are running the correct firmware versions
System Validation#
Waits for hosts to come back online after each firmware update cycle
Verifies agent connectivity to ensure management capabilities
Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION
Opens nodes in BCM and validates Slurm readiness for successful nodes
Post RMA Workflow Results#
After successful completion of the Post RMA workflow:
Hardware Configuration:
Physical components (M.2, E1.S drives, HMC, BMC, TPM) properly migrated to new tray
BF3 devices configured in NIC mode with OPROM enabled
Boot order corrected for reliable system startup
Power management and connectivity verified
Firmware Updates:
Compute firmware updated to specified versions via BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA
Mellanox BF3 and CX7 firmware updated to specified versions via BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA
All firmware components validated against expected versions
System Integration:
SSH connectivity to compute nodes verified
BCM device status confirmed and registered
Agent connectivity established for management capabilities
Comprehensive validation tests passed via BREAKFIX_COMPUTE_TRAY_VALIDATION
Service Restoration:
Nodes automatically opened in BCM for job scheduling
Slurm readiness validated for successful nodes
Hardware returned to production service automatically
Failure Handling: For any components that fail validation:
System maintains them in non-production state with maintenance tags
Detailed error logs available in runbook execution cells
Manual intervention required to address specific failure causes
Nodes remain drained until issues are resolved
NVIDIA Mission Control autonomous hardware recovery Domain Triage#
The BREAKFIX_DOMAIN_TRIAGE runbook provides manual diagnostics and troubleshooting for NVSwitch and NVLink domain-level issues. This workflow can be automatically triggered by BREAKFIX_TRIGGER when an entire rack is in broken state, or manually executed when domain-level problems are identified. This workflow is designed to collect comprehensive diagnostic information when problems are detected at the domain level, facilitating efficient resolution and minimizing system downtime.
Domain Triage Workflow Components#
The Domain Triage workflow consists of several key steps that ensure thorough diagnosis of NVSwitch and NVLink domain issues:
Compute Node Management#
Adds AHR maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workloads from running during diagnostics (no jobs will be scheduled on the entire rack)
NVSwitch Credential Management#
Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to NVSwitch components for diagnostics
Collects system rack serial numbers for identification
Diagnostic Data Collection#
Dumps BMC logs from NVSwitches to capture hardware-level events
Runs NVDebug tool to collect detailed information about NVSwitch status
Executes Nvlmapper tool to check NVLink status and connectivity
Runs PartnerDiag for comprehensive hardware diagnostics
Case Management#
Collects and organizes all diagnostic logs into a single package
Creates a Support ticket with all relevant diagnostic information
Attaches detailed logs to facilitate efficient troubleshooting
NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA#
Switch Post RMA Introduction#
The NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA workflow automates the process of bringing NVSwitch components back into service after a Return Merchandise Authorization (RMA) replacement. This comprehensive workflow includes both physical switch tray replacement procedures and automated software configuration to ensure that replaced switch hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.
Key Features#
Physical Replacement Procedures: Detailed instructions for safe switch tray removal and installation with proper cooling and power management
Compute Node Management: Adds maintenance tags and drains compute nodes during switch replacement to prevent workload interference
Switch Connectivity Verification: Establishes and verifies SSH connectivity to replaced switch components
Factory Reset and ZTP: Performs factory reset and monitors Zero Touch Provisioning for clean initialization
Firmware Updates: Updates switch firmware to match required versions using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA
System Validation: Comprehensive testing including NMX controller verification, compute node reboots, and compute tray validation
Switch Post RMA Workflow Components#
The Switch Post RMA workflow consists of several key steps that ensure replaced switch hardware is properly configured and validated:
Physical Replacement Procedures#
Switch Tray Removal: Detailed instructions for powering down the entire rack, cooling procedures, cable disconnection, and safe tray removal
Switch Tray Installation: Comprehensive installation guide covering rail migration, tray insertion, cable reconnection, and power-on sequence
Compute Node Management#
Adds maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workload interference during switch replacement
BCM Inventory Update#
ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY when a new switch is installed
Skip this step for repaired switches as MAC addresses remain unchanged
Ensures BMC MAC and COMe MAC addresses are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage asset inventory for customer deployments)
Enables proper management and monitoring of the replaced switch hardware
Switch Connectivity and Configuration#
Retrieves switch IP and credentials from BCM
Verifies SSH connectivity to the switch node
Updates ZTP settings in BCM with NVOS image file configuration
Ensures the switch is reachable for configuration operations
Factory Reset and ZTP#
Performs factory default reset on the switch
Monitors Zero Touch Provisioning (ZTP) status until successful completion
Creates support tickets if ZTP fails
Switch BMC Credential Management#
Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to switch components for diagnostics
Collects system rack serial numbers for identification
Firmware Updates#
Upgrades switch firmware to the specified version using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA
Verifies switch connectivity after firmware updates
System Health and Validation#
Performs switch tray health checks
Verifies NMX-C and NMX-T controller status on the active switch node
Reboots all compute nodes in the rack to ensure proper connectivity
Waits for agent connectivity to confirm successful recovery
Validates there are no inactive NVLinks
Runs comprehensive compute tray validation using BREAKFIX_COMPUTE_TRAY_VALIDATION