Connecting to NVIDIA Mission Control autonomous hardware recovery#

NVIDIA Mission Control autonomous hardware recovery’s authentication relies on BCM’s LDAP authentication so that users will leverage their BCM credentials to login to NVIDIA Mission Control autonomous hardware recovery.

Upon successful authentication, the user session receives a short lived JWT and refresh token. The JWT is used for identity and refreshed by the UI at expiration time.

An administrator can opt to manage users and groups using BCM. Any changes will automatically be reflected within NVIDIA Mission Control autonomous hardware recovery. There are two options to access NVIDIA Mission Control autonomous hardware recovery. The first option is to authenticate using SSO with BCM identity. The other option is to go to the url for the NVIDIA Mission Control autonomous hardware recovery UI. You will be presented with a login screen.

Login screen

Once you click the login button, the authentication page appears where you can enter credentials.

Main landing page

Users are authorized to perform different operations within NVIDIA Mission Control autonomous hardware recovery by configuring permission policies. Policies determine if the user can view resources and execute actions (named and/or anonymous). Action execution can be limited to a maximum number of impacted resources and / or specific resources. Permissions can also be attached to runbooks to allow / disallow certain users or groups. Please refer to documentation on Access Control for additional details.

New users are first created in BCM before they are able to access NVIDIA Mission Control autonomous hardware recovery. Upon logging in, there is a default permission policy that every user is assigned. The permissions of this policy are determined by the administrator. The administrator role has permission to perform any action in NVIDIA Mission Control autonomous hardware recovery. The configurator role has the permission to create, edit, delete any artifacts in NVIDIA Mission Control autonomous hardware recovery. By default, new users are granted administer and configure roles until the privileges are overridden. Defaults can be modified in the Access Control section of the NVIDIA Mission Control autonomous hardware recovery UI.

GB300#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to complete racks or multiple rack configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron(Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Baseline testing: Single Rack or Multi Rack test mode, then status, firmware checks, resource dashboard

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one click implementation of all testing procedures.

To access the baseline testing interface:

Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)
Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the following interface. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

Run SRT (Single Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Rack Configuration when one or multiple racks are ready for testing.

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.
Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.
Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.
Select the “Save” button positioned in the top right corner of the interface to preserve your settings.
Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.
To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

You are required to provide the following inputs for the runbook

Resource filtering (flex query): Runbooks use a flexible resource query instead of hardcoding a single rack. You specify:
- resource_tag (required): Tag for the flex resource query. Use rack_name for rack-based filtering (e.g., to run on a specific rack or set of racks).
- resource_value (required): Value for the flex resource query. Set to the same value you would have used for the rack name (e.g., B05, A01, m06|m07 for multiple racks), or use “none” if you do not want to filter by that tag. This allows you to target any resource—not just a single hardcoded rack.
- FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.
- IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:
  - Single node: node01
  - List format: [“node01”, “node02”]
  - Pipe-delimited string: node01|node02
After entering the correct resource_tag and resource_value (e.g., resource_tag=rack_name, resource_value=m06 or m06|m07), select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

IMPORTANT NOTES:

How to get the value for resource_value (e.g. rack_name):
1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value
2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
3. Click the button at right top “New Runbook” then you will see below.
4. In the central page, click “Op Statement” to create your first cell to query the resource
5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as below example.
6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not show up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.
Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Run MRT (Multi Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Rack Configuration.

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.
Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.
Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.
Select the “Save” button positioned in the top right corner of the interface to preserve your settings.
Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.
Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

The identifier represents the resource filter. For rack-based runs, use resource_tag=rack_name and resource_value set to the rack designation. The system accommodates both single and multiple rack configurations, as detailed below:

Single Rack Format:
- Standard notation: resource_value (Example: m06)
Multiple Rack Format:
- Standard notation: resource_value|resource_value (Example: m06|m07)
- Critical: No spaces are permitted between rack identifiers and the delimiter (|)
After entering the correct resource_tag and resource_value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

IMPORTANT NOTES:

how to get the value for resource_value (e.g. rack_name):
1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value
2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
3. Click the button at right top “New Runbook” then you will see below.
4. In the central page, click “Op Statement” to create your first cell to query the resource
5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as in the below example.
6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not showing up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.
Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

Below is the list of all mission control related runbooks including its name and description.

Category	Runbook Name	Description
	DGX_SUPERPOD_BASELINE_TESTING	EntryPoint Runbook
SRT	DGX_SUPERPOD_BASELINE_TESTING_SRT	EntryPoint Runbook of SRT
	SRT1	EntryPoint Runbook of all single node health checks
	SINGLENODE_HEALTHCHECK_GPU_CPU	Baseline health checks for GPU and CPU
	SINGLENODE_HEALTHCHECK_MEMORY_STORAGE	Baseline health checks for Memory and Storage
	SINGLENODE_HEALTHCHECK_NETWORK	Baseline health checks for Network
	SINGLENODE_HEALTHCHECK_SOFTWARE	Baseline health checks for installed software
	SINGLENODE_HEALTHCHECK_FIRMWARE	Baseline health checks for firmware
	SRT2	EntryPoint Runbook of component testing
	SR_MEMORY_BENCHPRESS	Benchmark testing for memory
	SR_CUDA_SAMPLES	Benchmark testing for CUDA
	SR_P2P_IPERF	Benchmark testing for pairwise ethernet interfaces
	HPL_MXP_TEST_SINGLE_NODE_MPIRUN	HPL_MXP testing on single node separately
	HPL_MXP_TEST_MPIRUN	HPL_MXP testing on the single rack
	SR_NVBANDWIDTH	Bandwidth testing running in single nvldomain
	NCCL_TEST	NCCL testing on the single rack
	SRT3	EntryPoint Runbook of burn-in performance testing
	HPL_MXP_TEST_BURN_IN_MPIRUN	HPL_MXP testing on the single rack with long duration
MRT	DGX_SUPERPOD_BASELINE_TESTING_MRT	EntryPoint Runbook of MRT
	MRT1	EntryPoint Runbook of rack level connectivity testing
	MR_INFINIBAND_CHECK_UFM	InfiniBand connectivity check via UFM
	IB_PERF_TEST_SINGLE_NODE	InfiniBand performance test on single node
	MRT2	EntryPoint Runbook of multi-rack performance testing
	MR_HPL_TEST	HPL testing cross multiple racks
	MR_NCCL_TEST	NCCL testing cross multiple racks
	MR_HPL_TEST_BURN_IN	HPL testing cross multiple racks with long duration
	MR_NCCL_TEST_BURN_IN	NCCL testing cross multiple racks with long duration
	MRT3	EntryPoint Runbook of cluster level testing
	Nemotron_15B	LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

Each row represents an individual cell
Each cell includes a play button for isolated execution
Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

Parameters Contains all required inputs for runbook execution
Triggers Configure automated execution methods:
- Alarm triggers
- Time Trigger (cron jobs)
- Other integrations like AlertManager
Users Manage permissions for who may run or edit the runbook
Settings General runbook configuration options
… more runbook operations including:
- Clone functionality
- Export options
- Delete runbook
- etc.

Check Job Status and Result#

Once you’ve initiated the SRT or MRT using the steps above, there are 2 options to check the job status and results:

Click “View Run” right after you “Created the Run” in above sections to redirect you to another page
Navigate to the “Runbook” section in the left panel, then click the “Run” button located in the upper left corner of the page as shown below. Important: Make sure you’ve selected the correct range in the upper right corner before proceeding.

Note: Runbooks can be nested within other runbooks. When this occurs, you may see a “Execution succeeded - View Run” link after a cell completes. Clicking this link will redirect you to a detailed results page for the nested runbook execution.

In the following part of this section, we will walk you through the topics below.

Check Job status - Whether the job is Running, Completed, Aborted, Terminated or Timed out.
Check Job results - if the job passed or failed, along with detailed logs

Check Job Status#

At the top of each job execution, you will observe one of the following status indicators:

Status Types and Definitions#

Running The job is currently executing. Progress is displayed as a percentage based on completed cells.
Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.
Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.
Terminated The job was forcibly ended by the system.
Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)
Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.
Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.
Results Section: The bottom portion displays:

Exit code status
Execution location information
Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

Configure output display preferences
Toggle the density
Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings”

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SRT or MRT job is recommended.
Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.
Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

The Landing page contains two tabs: Report Templates and Published Reports.

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

Health Checks & Alerts#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels for GB200. In addition, system wide health checks are performed by integrating with the UFM and NetQ network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks through NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

Alarms dashboard overview

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

Prolog Checks (run before the job starts):

If a check fails, the node is marked as DRAIN, and the job is re-queued.
If it passes, the job proceeds normally. Epilog Checks (run after the job finishes):
If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

SLURM_CHECKS_ENABLE – enables the checks
SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the ALARMS_ENABLE AND ALARMS_DISABLE runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI and perform the following:

Automatically enabled for racks that pass Single Rack Testing
Automatically disabled during firmware upgrade and Break/fix
Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.
ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

The system will run the following checks to check your system on a regular, 5 minute time period.

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state using ibstat, and also matching the expected transfer rate.

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms. These checks consist of:

check_bmc_ipmi_version : Checks BMC IPMI version against an expected value
check_nvidia_module_loaded : Verifies the NVIDIA module is loaded in the host OS
check_host_os_version : Verifies the DGX OS version matches the expected value
check_nvsm_status : Verify the NVSM service is currently active

periodic_cpu_mem_checks#

The following groups of periodic functional checks system memory:

check_cpu_health : Verifies CPU sockets and cores are present and online
check_dimm_count : Checks that all expected memory DIMMs are present
check_dimm_size : Checks that the size of each memory DIMM matches the expected values
check_memory_swap_size : Checks that the memory swap size matches the expected value

periodic_gpu_nvlink_checks#

The following groups of periodic functional perform NV Link related checks:

check_gpu_pci : Checks that all GPUs are present on the lspci interface and with the correct link width and speed
check_gpu_error : Checks GPUs for ECC errors, retired pages, and throttles present
check_gpu_powerstate : Checks the powerstate for each GPU and compares against an expected value
check_gpu_param : Checks that specified GPU parameters are present and correct for the host
check_nvlink_health : Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.
check_gpu_topology :Checks that there are no issues with the p2p topology within the node
check_gpu_telemetry : Checks that various sensors can be successfully read from the GPU using nvidia-smi
check_gpu_power_limit : Checks that the power limit is correct for each GPU
check_nvidia_inforom_ver : Checks that the inforom version is correct for each GPU
check_gpu_clock_info :Checks that the maximum clock speed is correct for each GPU
check_remapped_row : Checks if any remapped row events have occurred

periodic_network_checks#

check_ib_ber_and_ro : Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the CX7 using mlxlink
check_ib_port_rcv_errors :Check Infiniband devices port RCV errors
check_ib_cables : Checks the cable info using mlxcables
check_bf3_speed : Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#

check_pex_switch_health : Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci
check_cx_config : Checks that the ConnectX devices have the correct PCIe link speed and width using lspci and ACS config using setpci
check_nvme_health : Checks that the PCIe link speed and width of each NMVe device matches the expected value
check_storage_dir : Checks that the host has functional access to the home storage
check_storage_util : Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

This setting checks for issues where the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

check_sel_event : Read the SEL events from the BMC and ensure none are asserted
check_dgx_os_version : Verifies the DGX OS version matches the expected value
check_gpu_vbios_ver : Checks the VBIOS version of the GPUs and compares against an expected value
check_nvme_fw_ver : Checks that the FW version for each NVMe matches an expected value
check_kernel_commandline_opt : Verifies the specified kernel option(s) is present in the current kernel’s boot parameters
check_host_bios_ver : Verifies the system’s BIOS version
check_kernel_ver : Verifies the current version of the Linux kernel
check_host_package_versions : Queries the installed packages on the host
nv_container_cli_info : Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval, and Automation. An example configuration is shown in the following figure.

Alarm configuration example

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | rack_name =~ “.*”` will only check alarms on hosts which have a value set for the “rack_name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Alarm states screenshot

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB200 racks. The four distinct components for which firmware can be upgraded using this process are:

Compute trays
Switches
Mellanox
NVOS

The workflow invocation is performed using autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search using the FIRMWARE_UPGRADE label.

Note

To do firmware updates within Base Command Manager or the nvfwupd tool itself, refer to the NVIDIA DGX GB200/GB300 Firmware Update Guide.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB200. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets, when opted in to the support ticket service, for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

Key Features#

Automatic Detection: Identifies drained nodes without manual intervention
Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms
Comprehensive Diagnostics: Performs thorough hardware and software checks
Automated Remediation: Attempts to resolve issues without human intervention when possible
Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute trays. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry point runbook that:

Runs automatically every five minutes using the time trigger
Checks for any drained nodes in BCM
Initiates the triage process for affected nodes
Routes to the appropriate diagnostic workflow

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

Key Features#

Automated Configuration: Configures replaced hardware components with proper settings
Firmware Updates: Updates firmware to match the required versions for the environment
Boot Order Correction: Ensures proper boot sequence for reliable operation
Comprehensive Validation: Performs thorough testing to verify hardware functionality
Seamless Integration: Automatically returns validated hardware to service

Post RMA Workflow Components#

The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:

Physical Replacement Procedures#

Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling
Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection
Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection

BCM Inventory Update#

ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed
Skip this step for repaired trays as MAC addresses remain unchanged
Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)
Enables proper management and monitoring of the replaced hardware

BMC Credential Management#

Creates necessary BMC credential files for secure access to hardware components
Establishes secure communication channels for configuration operations

BlueField Configuration#

Checks if BlueField devices are in NIC mode
Enables OPROM on BlueField devices to ensure proper initialization
Configures hardware components for optimal operation

Boot Order Correction#

Ensures the boot sequence is properly configured
Prevents boot failures and improves system reliability
Performs power reset through BMC after configuration changes

Connectivity Verification#

Verifies SSH connectivity to compute nodes
Checks BCM device status to ensure proper registration
Confirms network accessibility before proceeding with firmware updates

Firmware Updates#

Compute Firmware: Updates compute firmware using the BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA runbook
Mellanox Firmware: Updates BlueField and ConnectX firmware using the BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA runbook
Ensures all hardware components are running the correct firmware versions

System Validation#

Waits for hosts to come back online after each firmware update cycle
Verifies agent connectivity to ensure management capabilities
Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION
Opens nodes in BCM and validates Slurm readiness for successful nodes

Post RMA Workflow Results#

After successful completion of the Post RMA workflow:

Hardware Configuration:

Physical components (M.2, E1.S drives, HMC, BMC, TPM) properly migrated to new tray
BlueField devices configured in NIC mode with OPROM enabled
Boot order corrected for reliable system startup
Power management and connectivity verified

Firmware Updates:

Compute firmware updated to specified versions using BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA
Mellanox BlueField and ConnectX firmware updated to specified versions using BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA
All firmware components validated against expected versions

System Integration:

SSH connectivity to compute nodes verified
BCM device status confirmed and registered
Agent connectivity established for management capabilities
Comprehensive validation tests passed using BREAKFIX_COMPUTE_TRAY_VALIDATION

Service Restoration:

Nodes automatically opened in BCM for job scheduling
Slurm readiness validated for successful nodes
Hardware returned to production service automatically

Failure Handling: For any components that fail validation:

System maintains them in non-production state with maintenance tags
Detailed error logs available in runbook execution cells
Manual intervention required to address specific failure causes
Nodes remain drained until issues are resolved

NVIDIA Mission Control autonomous hardware recovery Domain Triage#

The BREAKFIX_DOMAIN_TRIAGE runbook provides manual diagnostics and troubleshooting for NVSwitch and NVLink domain-level issues. This workflow is only manually triggered on demand when domain-level problems are identified; it is not automatically triggered by BREAKFIX_TRIGGER (unlike the Break/Fix Workflow, which is triggered automatically). This workflow is designed to collect comprehensive diagnostic information when problems are detected at the domain level, facilitating efficient resolution and minimizing system downtime.

Domain Triage Workflow Components#

The Domain Triage workflow consists of several key steps that ensure thorough diagnosis of NVSwitch and NVLink domain issues:

Compute Node Management#

Adds AHR maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workloads from running during diagnostics (no jobs will be scheduled on the entire rack)

NVSwitch Credential Management#

Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to NVSwitch components for diagnostics
Collects system rack serial numbers for identification

Diagnostic Data Collection#

Dumps BMC logs from NVSwitches to capture hardware-level events
Runs NVDebug tool to collect detailed information about NVSwitch status
Executes Nvlmapper tool to check NVLink status and connectivity
Runs PartnerDiag for comprehensive hardware diagnostics

Case Management#

Collects and organizes all diagnostic logs into a single package
Creates a Support ticket with all relevant diagnostic information
Attaches detailed logs to facilitate efficient troubleshooting

NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA#

Switch Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA workflow automates the process of bringing NVSwitch components back into service after a Return Merchandise Authorization (RMA) replacement. This comprehensive workflow includes both physical switch tray replacement procedures and automated software configuration to ensure that replaced switch hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

Key Features#

Physical Replacement Procedures: Detailed instructions for safe switch tray removal and installation with proper cooling and power management
Compute Node Management: Adds maintenance tags and drains compute nodes during switch replacement to prevent workload interference
Switch Connectivity Verification: Establishes and verifies SSH connectivity to replaced switch components
Factory Reset and ZTP: Performs factory reset and monitors Zero Touch Provisioning for clean initialization
Firmware Updates: Updates switch firmware to match required versions using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA
System Validation: Comprehensive testing including NMX controller verification, compute node reboots, and compute tray validation

Switch Post RMA Workflow Components#

The Switch Post RMA workflow consists of several key steps that ensure replaced switch hardware is properly configured and validated:

Physical Replacement Procedures#

Switch Tray Removal: Detailed instructions for powering down the entire rack, cooling procedures, cable disconnection, and safe tray removal
Switch Tray Installation: Comprehensive installation guide covering rail migration, tray insertion, cable reconnection, and power-on sequence

Compute Node Management#

Adds maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workload interference during switch replacement

BCM Inventory Update#

ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY when a new switch is installed
Skip this step for repaired switches as MAC addresses remain unchanged
Ensures BMC MAC and COMe MAC addresses are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage asset inventory for customer deployments)
Enables proper management and monitoring of the replaced switch hardware

Switch Connectivity and Configuration#

Retrieves switch IP and credentials from BCM
Verifies SSH connectivity to the switch node
Updates ZTP settings in BCM with NVOS image file configuration
Ensures the switch is reachable for configuration operations

Factory Reset and ZTP#

Performs factory default reset on the switch
Monitors Zero Touch Provisioning (ZTP) status until successful completion
Creates support tickets if ZTP fails

Switch BMC Credential Management#

Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to switch components for diagnostics
Collects system rack serial numbers for identification

Firmware Updates#

Upgrades switch firmware to the specified version using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA
Verifies switch connectivity after firmware updates

System Health and Validation#

Performs switch tray health checks
Verifies NMX-C and NMX-T controller status on the active switch node
Reboots all compute nodes in the rack to ensure proper connectivity
Waits for agent connectivity to confirm successful recovery
Validates there are no inactive NVLinks
Runs comprehensive compute tray validation using BREAKFIX_COMPUTE_TRAY_VALIDATION

B200#

Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to Dashboards#

Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.
The Landing page contains two tabs: Dashboards and Dashboard Views.
The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.
A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

Cluster Validation#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures for DGX B200 systems. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to multiple node configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron (Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Note: In B200 systems, “Unit” refers to an individual compute node.

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one-click implementation of all testing procedures.

To access the baseline testing interface:

Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)
Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the following interface. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. Obtain the SOT file from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them. If any execution of the firmware validation runbook results in versions not found, please re-run the runbook.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SUT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

The Landing page contains two tabs: Report Templates and Published Reports.

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SUT and MUT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SUT and MUT tests have been executed.

Guide to initiate Reporting for Baseline Testing#

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.
Optional: While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.
Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.
Retain the auto-generated name for the report which includes the timestamp or provide your own name.
Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SUT and MUT linked reports.
A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.
When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SUT and MUT reports, with the data captured at that given time.
All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

Navigating to Previously Published Reports#

Navigate to the Reports from the NVIDIA Mission Control autonomous hardware recovery Home Page
Click on “Published Reports”, and select the “DGX_SUPERPOD_BASELINE_TESTING_REPORT_timestamp” or any other report of interest.
You can also adjust the time range at the top of the screen to view reports generated within specific time frames. Options include viewing reports from the last 10 minutes, last 1 hour, or selecting a custom time range to explore older data beyond these periods.

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#

“DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SUT/MUT tests for the compute nodes.
Each stage is represented by a separate cell, displaying the results for that specific SUT or MUT test.
To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.
Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SUT and MUT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SUT1, SUT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

The Y-axis displays the Rack name, allowing you to quickly identify the location of each resource within the cluster.
The X-axis represents the number of trays within each rack.

For example, in the following visualization, each bar graph indicates the number of trays within a rack. In this case, there are 2 trays per rack, as denoted by the number displayed on the bars. This provides a clear view of the test progress for each tray across different racks. ** note: for NVL72, you will have 18 trays per rack in the report.

You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.
Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SRT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.
Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.
In this example, GPU VBIOS Version has failed for all the trays. Click on the bar for one of the racks to list the resources that the test failed on.
Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.
The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (VBIOS). Alternatively, you can also scroll through the run to find the failed tests.
Click on the “Command filter excluded x/y resources” to get detailed output for each resource.
Click on the Output to view the detailed logs

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.
Expected Version: Shows the firmware version that was expected for the node.
Current Version: Displays the actual firmware version currently installed on the node.

Navigating to Tabular Report for Firmware Checks#

You can access the SINGLENODE_HEALTHCHECK_FIRMWARE_<timestamp> report either through the Runs (Check Job Status and Result) section or by navigating to the Runs History (see Root Cause Analysis).
Scroll to the last cell of the execution and click on the output of the cell.
You can scroll horizontally and vertically to view the complete output. Additionally, you have the option to download the cell output for further analysis.
You can also navigate to the Reports section and view the tabular output by clicking on the status bar. This includes both the expected firmware version and the current firmware version.

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to the Resource Dashboard#

Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.
The Landing page contains two tabs: Dashboards and Dashboard Views.
The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.
A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SUT and MUT.

Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.
Hostname & Rack Information: For each resource, you will see the hostname and rack name, along with a Tag sequence that indicates the stage of both the SUT and MUT tests.
Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SUT and MUT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.
Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

The dashboard allows you to sort the resources by Name, Rack, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

Name: Sort resources alphabetically by their hostname for quick access.
Rack: Sort resources based on their rack assignment, ideal for organizing by physical location.
Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

Click on the “Create View” button located at the top right corner of your screen.
Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.
Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.
The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

Prolog Checks (run before the job starts):
- If a check fails, the node is marked as DRAIN, and the job is re-queued.
- If it passes, the job proceeds normally.
Epilog Checks (run after the job finishes):
- If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

SLURM_CHECKS_ENABLE – enables the checks
SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_ENABLE” AND “ALARMS_DISABLE” runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI and perform the following:

Automatically enabled for racks that pass Single Rack Testing
Automatically disabled during firmware upgrade and Break/fix
Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.
ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state via ibstat, and also matching the expected transfer rate

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms.

check_bmc_ipmi_version - Checks BMC IPMI version against an expected value

check_nvidia_module_loaded - Verifies the NVIDIA module is loaded in the host OS

check_host_os_version - Verifies the DGX OS version matches the expected value

check_nvsm_status - Verify the NVSM service is currently active

periodic_cpu_mem_checks#

check_cpu_health - Verifies CPU sockets and cores are present and online

check_dimm_count - Checks that all expected memory DIMMs are present

check_dimm_size - Checks that the size of each memory DIMM matches the expected values

check_memory_swap_size - Checks that the memory swap size matches the expected value

periodic_gpu_nvlink_checks#

check_gpu_pci - Checks that all GPUs are present on the lspci interface and with the correct link width and speed

check_gpu_error - Checks GPUs for ECC errors, retired pages, and throttles present

check_gpu_powerstate - Checks the powerstate for each GPU and compares against an expected value

check_gpu_param - Checks that specified GPU parameters are present and correct for the host

check_nvlink_health - Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.

check_gpu_topology - Checks that there are no issues with the p2p topology within the node

check_gpu_telemetry - Checks that various sensors can be successfully read from the GPU via nvidia-smi

check_gpu_power_limit - Checks that the power limit is correct for each GPU

check_nvidia_inforom_ver - Checks that the inforom version is correct for each GPU

check_gpu_clock_info - Checks that the maximum clock speed is correct for each GPU

check_remapped_row - Checks if any remapped row events have occurred

periodic_network_checks#

check_ib_ber_and_ro - Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the ConnectX using mlxlink

check_ib_port_rcv_errors - Check Infiniband devices port RCV errors

check_ib_cables - Checks the cable info using mlxcables

check_bf3_speed - Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#

check_pex_switch_health - Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci

check_cx_config - Checks that the ConnectX devices have the correct PCIe link speed and width via lspci and ACS config via setpci

check_nvme_health - Checks that the PCIe link speed and width of each NMVe device matches the expected value

check_storage_dir - Checks that the host has functional access to the home storage

check_storage_util - Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

Checks for which the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

check_sel_event - Read the SEL events from the BMC and ensure none are asserted

check_dgx_os_version - Verifies the DGX OS version matches the expected value

check_gpu_vbios_ver - Checks the VBIOS version of the GPUs and compares against an expected value

check_nvme_fw_ver - Checks that the FW version for each NVMe matches an expected value

check_kernel_commandline_opt - Verifies the specified kernel option(s) is present in the current kernel’s boot parameters

check_host_bios_ver - Verifies the system’s BIOS version

check_kernel_ver - Verifies the current version of the Linux kernel

check_host_package_versions - Queries the installed packages on the host

nv_container_cli_info - Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval and Automation. An example configuration is shown in the following figure.

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | name =~ “.*”` will only check alarms on hosts which have a value set for the “name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Automation#

You may use the Automation settings to have Runbooks triggered when an alarm fires, and you may also customize the informational messages that are displayed in the Alarm’s logs.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

Overview#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB300 racks. The distinct components for which firmware can be upgraded using this process are:

Compute trays
Switches
Mellanox
NVOS
Powershelf (PSU and PMC)

Asynchronous component workflows: Firmware upgrade workflows for different component types (compute tray, switch, Mellanox, and NVOS) run asynchronously; powershelf firmware upgrades are not included in asynchronous execution yet. You can schedule, run, and monitor each component workflow independently, and multiple firmware runs may be in progress for different components at the same time. The subsections below describe each workflow; use run history and Firmware Reports to track status across concurrent upgrades.

The workflow invocation is performed via autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search by the FIRMWARE_UPGRADE label as shown below.

Using the filter will reduce the runbooks displayed to a list. In general, you will use these runbooks to upgrade the compute trays, switches, or switch NVOSes by following the steps in the next section.

Preparing the Upgrade#

In the firmware upgrade runbook, nvfwupd and nv action are used with the upgrade package file to determine the versions needing upgrade. You will need to obtain the firmware package and the Source of Truth (SOT) JSON file, the latter of which defines the referenced settings used for validation.

The SOT JSON may be obtained from the NVIS team, whereas the firmware packages may be downloaded from the NVIDIA Application Hub.

Source of Truth Snippet (truncated)#

{
    "ProductName": "DGX-GB300-NVL72",
    "SOTUniqueID": "1",
    "SOTType": "Release",
    "Milestones": [
        {
            "TemplateVersion": "0.5",
            "Id": "f32d9ee4-2df4-4544-9f2a-e4e19d7cd894",
            "Name": "1.0.00GA",
            "State": "Onboarded",
            "ReleaseDate": "2025-09-04T16:49:58.186442",
            "ReleaseCustomers": [],
            "Tests": [],
            "Packages": [],
            "BoardSKUs": [
                {
                    "SKUID": "699-24764-0001-TS3,699-24764-0001-TS1,692-24764-0001-000",
                    "Name": "P4059",
                    "Components": {
                        "Software": [
                            {
                                "Component": "DOCA_Host",
                                "Version": "3.1.0-091513",
                                "External": true,
                                "Sideload": true,
                                "Informational": false,
                                "Type": "Prod",
                                "Locations": [
                                    {
                                        "Location": "https://linux.mellanox.com/public/repo/doca/3.1.0-091513/ubuntu24.04/arm64-sbsa/doca-ofed_3.1.0-091513_arm64.deb",
                                        "LocationType": "HTTP",
                                        "Distro": "Ubuntu",
                                        "Architecture": "All",
                                        "PackageName": "GB300NVL72_DOCA_MFT",
                                        "PackageSubdirectory": "",
                                        "External": false
                                    }
                                ],
                                "SubComponents": []

Ordering Constraints#

The runbooks used to perform upgrades automatically determine the ordering of applicable packages and will AUX cycle nodes when appropriate. The following paragraph describes this ordering, but is for informational purposes only, as there is no requirement for the user.

For the compute tray, older firmware packages require the BMC to be upgraded prior to HMC, but this is no longer the case with modern firmware. Both BMC and HMC can be upgraded within a single AC cycle.

For the switch tray, the BMC firmware should be updated first. SBIOS and CPLD packages may be upgraded within the same AC cycle. Our runbooks take care of this ordering for you.

Coordination with other jobs#

To prevent other tasks from utilizing the nodes undergoing the upgrade process, AHR will do two things:

It will tag the nodes with a special maintenance tag.
Subsequently, it will drain the node via Slurm on BCM.

In particular, this will prevent other upgrade processes from interfering, and will also bypass AHR’s breakfix workflow. This tag will be automatically removed upon successful completion of the upgrade, and the node will be undrained. If there’s an issue during the upgrade process, this tag and drain state will remain for further investigation. At this point of failure, the user should review the failures, and return the nodes to undrained and remove the maintenance tag once the nodes are deemed healthy.

If you need to remove the maintenance tags after the firmware upgrade process encounters an issue, troubleshooting has completed, or even after unsuccessful breakfix triage, you may do so using the CLEAR_MAINTENANCE_TAGS runbook, specifying resource_tag (e.g. rack_name) and resource_value (e.g. B05) as parameters.

Performing the Upgrade#

Entry-point runbook: FIRMWARE_UPGRADE

The FIRMWARE_UPGRADE runbook is the single entry point for upgrading both compute and switch firmware (and optionally Mellanox/NVOS when the corresponding paths are provided). It performs resource scoping, maintenance tagging, drain, and then invokes the appropriate child runbooks: FIRMWARE_UPGRADE_COMPUTE when compute packages are specified, and FIRMWARE_UPGRADE_SWITCH when switch packages are specified. After upgrades, it runs AUX cycle (if enabled), undrains nodes, removes the maintenance tag, and runs firmware validation. Use this runbook for combined compute-and-switch upgrades or for compute-only or switch-only upgrades by setting only the relevant package path(s).

This section details the steps and parameters for each component. The runbooks take care of invoking the process, performing the upgrade, calling any ancillary runbooks, and ultimately performing validation of the upgraded component once complete.

_images/firmware_upgrade_parent_flowchart.png

Figure: Firmware Upgrade (high-level). Detailed flows: Compute Tray (fw_image01), Switch Tray (fw_image02), Switch NVOS (fw_image03), Mellanox (mellanox_fw_upgrade_flowchart).

Compute Tray#

Prerequisites#

Download the firmware upgrade packages.
Obtain the SOT JSON file from the NVIS team.
On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.
- Upgrade packages use the following naming convention:
  - nvfw_DGX*
  - nvfw_HGX*

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point for compute and switch). To upgrade compute firmware, provide the following parameters. The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE when FWPKG_DIR_PATH_COMPUTE is set.

Required parameters:

resource_tag – Tag for flex resource query (use rack_name for rack filtering).
resource_value – Value for flex resource query (e.g., rack name or regex: A01, m06|m07).
FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.
FWPKG_DIR_PATH_COMPUTE – Directory containing the compute firmware upgrade packages (nvfw_DGX*, nvfw_HGX*). Set to empty if you are only upgrading switch.

Optional parameters (commonly used):

FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware packages. Set if you also want to upgrade switch in the same run; leave empty for compute-only.
NVOS_FILE_PATH – Full path to the NVOS bin file (if upgrading switch NVOS).
FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package. Set this (and FWPKG_BF_FILE_PATH) if you also need to upgrade Mellanox (ConnectX/BlueField) in the same run as compute.
FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package. Set this (and FWPKG_CX_FILE_PATH) if you also need to upgrade Mellanox in the same run as compute.
FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks; false follows standard upgrade rules.
IGNORE_LIST – List of nodes to exclude from scope (e.g., node01, or “none”).
AUX_CYCLE – When true (default), performs an AUX power cycle after upgrade; set to false to skip.

When you need to upgrade compute and Mellanox together, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH set to the ConnectX and BlueField package paths. When you need to upgrade only compute firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_DIR_PATH_SWITCH left empty. For advanced scenarios you may run the BreakFix_Firmware_Upgrade_Compute_nvfwupd runbook directly (e.g., single rack).

Switch Tray#

Prerequisites#

Download the firmware upgrade packages.
Obtain the golden configuration JSON file from the NVIS team.
On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.
- Upgrade packages use the following naming convention:
  - *nvfw*_0004_*
  - *nvfw*_0006_*
  - *nvfw*_0007_*

Upgrading#

When performing this upgrade, it should be noted that all of the rack’s 18 nodes will be drained from the Slurm pool, tagged with our maintenance tag, and subsequently AUX cycled.

To begin, from the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point). To upgrade switch firmware, set FWPKG_DIR_PATH_SWITCH (and optionally NVOS_FILE_PATH for NVOS). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH when FWPKG_DIR_PATH_SWITCH is set. Provide the same resource_tag, resource_value, and FW_SOURCE_JSON_PATH as for compute.

Required parameters (when upgrading switch):

resource_tag – Tag for flex resource query (use rack_name for rack filtering).
resource_value – Value for flex resource query (e.g., A01).
FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.
FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware upgrade packages. Set to empty if you are only upgrading compute.

Optional: NVOS_FILE_PATH – Full path to the NVOS bin file if upgrading switch NVOS. FORCE_UPGRADE, AUX_CYCLE – same as for compute.

When you need to upgrade only switch firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_SWITCH set and FWPKG_DIR_PATH_COMPUTE left empty. The FIRMWARE_UPGRADE_SWITCH child runbook is invoked by the parent and is not intended to be run directly. For advanced scenarios you may run the BreakFix_Switch_BMC_Upgrade_Nvfwupd runbook directly.

Switch NVOS#

Prerequisites#

Download the NVOS upgrade package bin file.
Obtain the golden configuration JSON file from the NVIS team.
On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Switch Tray and Compute. To upgrade switch NVOS, set NVOS_FILE_PATH and optionally FWPKG_DIR_PATH_SWITCH (if also upgrading switch tray firmware). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH, which performs both switch tray firmware and NVOS upgrade when NVOS_FILE_PATH is provided.

Required parameters:

resource_tag – Tag for flex resource query (use rack_name for rack filtering).
resource_value – Value for flex resource query (e.g., A01).
FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.
NVOS_FILE_PATH – Full path to the NVOS bin file including the file itself.

Optional: FWPKG_DIR_PATH_SWITCH – Set if you are also upgrading switch tray firmware in the same run; leave empty if only upgrading NVOS.

Alternatively, run BreakFix_NVOS_Upgrade directly with resource_value, NVOS_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_SWITCH (from flex query or parent context).

Mellanox#

Figure: Mellanox FW Upgrade (ConnectX and BlueField) — from BreakFix_Firmware_Upgrade_Mellanox runbook.

Prerequisites#

Download the BlueField and ConnectX upgrade packages.
Obtain the golden configuration JSON file from the NVIS team.
On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Compute and Switch. To upgrade Mellanox (ConnectX and BlueField) firmware, set FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH; you can set FWPKG_DIR_PATH_COMPUTE to a directory (or leave empty if only Mellanox). The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE, which performs ConnectX and BlueField firmware upgrade when these paths are provided.

Required parameters:

resource_tag – Tag for flex resource query (use rack_name for rack filtering).
resource_value – Value for flex resource query (e.g., A01, or regex).
FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.
FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package file.
FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package file (either .bin or .bfb package).

Optional: FWPKG_DIR_PATH_COMPUTE – Directory of compute firmware packages if you are also upgrading compute tray in the same run; leave empty for Mellanox-only.

Alternatively, run BreakFix_Firmware_Upgrade_Mellanox directly with FWPKG_CX_FILE_PATH, FWPKG_BF_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_COMPUTE (from flex query or parent context).

Powershelf#

GB300 supports firmware upgrades for power shelves (LiteOn and Delta). Use the FIRMWARE_UPGRADE_POWERSHELF runbook. It upgrades PMC (Power Management Controller) first, then PSU, and finally runs SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF for validation.

Prerequisites#

Obtain the PSU and PMC firmware packages and the golden configuration JSON file (from the NVIS team or firmware package).
On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE_POWERSHELF runbook. Required parameters (per runbook):

resource_tag – Tag for flex resource query (use rack_name for rack filtering).
resource_value – Value for flex resource query (e.g., A01, or regex).
IGNORE_LIST – List of nodes to ignore for baseline tests. Accepted formats: single node (e.g. node01), list (e.g. [“node01”, “node02”]), pipe-delimited (e.g. node01|node02), or “none” if no nodes should be ignored.
PSU_FILE_FULL_PATH – Full path to the directory and file name of the PSU firmware package.
PMC_FILE_FULL_PATH – Full path to the directory and file name of the PMC firmware package.
FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. This file defines the reference settings used for validation.
FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks or validation; false follows standard upgrade rules.

The runbook invokes POWERSHELF_COMPONENT_UPGRADE for PMC first, then PSU, and finally SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to verify the upgrade.

Figure: Powershelf Firmware Upgrade Workflow – This diagram illustrates the end-to-end Powershelf firmware upgrade process. FIRMWARE_UPGRADE_POWERSHELF obtains the target powershelves via flex query, then invokes POWERSHELF_COMPONENT_UPGRADE for PMC (Power Management Controller) first and PSU next, each with optional exit on failure. The workflow concludes with SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to validate the upgrade against the golden configuration JSON.

BF Firmware Bundle Extraction Guide#

This guide explains how to extract a BF firmware package provided as a BF Bundle (.bfb).

Important:
These steps must be performed on the compute node, as it already has the bfb-tool utility installed.

Prerequisites#

Ensure the following dependency is installed:

sudo apt install -y qemu-user-static

Extracting the BF Bundle

Note: The --bfb argument must use the complete (absolute) path to the .bfb file. Relative paths may cause the extraction to fail.

bfb-tool extract \
  --bfb /path/to/bf-fwbundle-<version>-prod.bfb \
  --opn 900-9D3B6-00CN-P_Ax

Using the Extracted Firmware

After extraction, go to the folder created in /tmp (named after the .bfb file). Inside this folder, open the subfolder corresponding to your OPN (e.g., 900-9D3B6-00CN-P_Ax). In that subfolder, locate the .bin firmware file, which should be used as the input to the runbook.

Troubleshooting#

1. Excluding a node from upgrade or runbook cells#

To exclude a particular node, you can add a cell near the start of your runbook (after INPUT_SWITCH is exported) with something like the following:

INPUT_SWITCH | name != "<node_name>" | export(“INPUT_SWITCH")

This will export all the resources except <node_name> back into the INPUT_SWITCH variable.

Firmware upgrade includes a series of steps to ensure the nodes are removed from jobs being scheduled, complete the upgrade, and place the nodes back in service. Below are some common issues you may encounter during the firmware upgrade:

2. Nodes not reachable from headnode#

To upgrade firmware, the BMC IP must be accessible from the headnode. The runbook verifies node accessibility and automatically skips unreachable nodes.

Action: Ensure the node is online and accessible from the headnode, then rerun the firmware upgrade runbook.

3. Failed firmware upgrade#

The firmware upgrade failed to complete successfully. Possible causes include failures in the nvfwupd command, NV OS, or the flint command, depending on the package.

Action: Verify logs by clicking on the Output, which includes the command’s stdout and stderr.

Note that the runbook does not automatically undrain or untag maintenance when the firmware upgrade fails. After verifying that the failures are safe to ignore and the nodes are ready to return to the pool, undrain, and untag the nodes using the UNDRAIN_AND_UNTAG_RACKS runbook. Provide resource_tag (e.g. rack_name) and resource_value (e.g. B05 or m06|m07) when prompted.

4. SSH failures for Switch Upgrade runbooks.#

The switch firmware upgrade runs commands via SSH. The runbook dynamically retrieves the user and password from BCM using cmsh commands, which are then used to run the SSH commands.

Action: Verify SSH access to the switch from the headnode using the credentials stored in BCM.

5. Failed validation stage#

The last step in every firmware upgrade is validation. The runbook selects a subset of tests to verify upgrade success. Failures may result from upgrade issues, incorrect SOT JSON, or command failures when fetching component versions.

Action: Compare the expected and actual versions in the logs and check for any other errors.

Notes#

Netcat is used to check if nodes are back online after a reboot.
The runbook cannot exclude a subset of nodes. This means if any nodes are down, the runbook will ignore the node and upgrade others.
Multiple racks cannot be upgraded simultaneously.
If nvfwupd does not upgrade, (due to already being at the specific version, for example) and FORCE_UPGRADE is not specified as true, the runbook will exit after untagging from maintenance.

Links#

Please see Firmware Reports for more information on Firmware related reports.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

Break/Fix Introduction#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB300. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

Figure: Compute Break/Fix Workflow (high-level) — Trigger, Triage, Validation, then Return to service or Run diag / Support ticket.

Figure: Compute Break/Fix Workflow (detailed) — Explains each phase: 1. Triage (connectivity check, SRAM UC check, leak detection, power cycle, GPU recovery), 2. Verification (BREAKFIX_COMPUTE_TRAY_VALIDATION, tests passed?), 3. Run diag (BREAKFIX_DIAG_DUMP, support ticket creation). All paths from triage converge on validation; failed validation leads to diagnostic dump and support ticket; success leads to return to service.

Key Features#

Automatic Detection: Identifies drained nodes without manual intervention
Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms
Comprehensive Diagnostics: Performs thorough hardware and software checks
Automated Remediation: Attempts to resolve issues without human intervention when possible
Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

To access the break/fix interface:

Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
Locate the “BREAKFIX_TRIGGER” runbook through the search functionality
Upon selecting the “BREAKFIX_TRIGGER” runbook, you will be presented with the interface showing the workflow for automated break/fix.

Run Break/Fix Workflow#

The BREAKFIX_TRIGGER is a time-triggered runbook that automatically runs every 5 minutes. When executed, it:

Gets drained nodes from BCM and proceeds only if the drain reason is one of the allowed reasons. Otherwise the runbook exits with no action.
Allowed drain reasons (runbook exits if no nodes have these reasons): Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.
Processes one node per execution cycle from the drained node pool; applies a maintenance tag in AHR to prevent duplicate processing, and ensures drained nodes are handled sequentially across multiple runs.
Verifies the AHR agent is connected and accepts commands before proceeding; exits if the agent does not accept commands.
Gets the drain reason from BCM, updates it with “(AHR in-progress)”, and re-drains the node via BCM with the updated reason.
Invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously to perform triage on the selected compute node.
No manual intervention is required to start the process.

The BREAKFIX_TRIGGER runbook is shared across GB300, GB200, and B200. It has no user-configurable scope parameters; it discovers drained nodes from BCM automatically.

One can see the time trigger settings on the right side of the Runbook under Triggers, where it shows that it is currently enabled and runs every 5 minutes.

You can also manually trigger the workflow:

Navigate to the “BREAKFIX_TRIGGER” runbook in the Runbooks section
Select the “Create Run” button positioned in the top right corner of the interface
After initiating the process, a confirmation dialog will appear with a “View Run” link
Selecting this link will redirect you to a page displaying comprehensive job status and details

The automated nature of this workflow ensures that system issues are addressed promptly without requiring constant monitoring or manual intervention.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute nodes. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry-point runbook (shared for GB300, GB200, and B200 compute break/fix) that:

Runs automatically every 5 minutes via time trigger.
Gets drained nodes from BCM and only proceeds if the drain reason is one of the allowed reasons: Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.
Processes one node per run: selects a drained node not already in maintenance, verifies AHR agent accepts commands, sets the maintenance tag, gets and updates the drain reason with “(AHR in-progress)”, re-drains via BCM, then invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously.
Has no user-facing parameters; node discovery and scope are driven by BCM drained state.

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.

This runbook performs comprehensive triage on drained compute trays with two main workflows:

Workflow for Unresponsive Compute Nodes#

Initial Assessment
- Tests connectivity using ping to check if compute nodes are responsive or unresponsive
- Identifies nodes that are already unresponsive and require recovery
Leak Detection
- Checks if any leaking is reported through BCM
- Creates Support ticket immediately for any nodes with detected leaking (if opted in to support ticket service)
Recovery Process for Non-Leaking Nodes
- Initiates power cycle for nodes without leaking issues
- Waits and checks if hosts come back online
- Waits until the AHR agent is connected to confirm successful recovery
Failure Handling
- Creates Support ticket for hosts that fail to start up (if opted in to support ticket service)
Validation
- For recovered nodes, automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION to verify functionality

Workflow for Responsive Compute Nodes#

GPU Recovery Assessment
- Checks if any GPU recovery action is present
- Routes to GPU_RECOVERY runbook if GPU issues are detected
- Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION for nodes without specific issues

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

This specialized diagnostic runbook focuses on GPU-related issues:

Verification and Assessment
- Verifies the node is still drained from BCM
- Categorizes recovery actions (Reboot, Reset, or None)
Recovery Actions Based on Type
- For Reboot Action:
  - Reboots the node requiring GPU reboot
  - Waits for host to come back online
  - Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION if host is up
  - Creates Support ticket for host that fail to start up (if opted in to support ticket service)
- For Reset Action:
  - Resets the GPU
  - Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION to verify on successful GPU reset
  - If reset fails, automatically runs BREAKFIX_DIAG_DUMP and creates a Support Ticket (if opted in to support ticket service)
- For No Action Required:
  - Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION directly

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

After remediation actions, this runbook validates the system health:

Comprehensive Testing
- Runs testing suites to validate the compute tray
- Executes HPL test (HPL use mpirun for single-node execution instead of Slurm for break/fix scenarios)
- Runs AHR prolog script to prevent undraining of nodes that are still failing prolog checks, since undraining will result in them being drained again at the next Slurm invocation
Result Handling
- For failed tests, automatically runs BREAKFIX_DIAG_DUMP for detailed diagnostics
- For passed tests, undrains/untags the host to return it to service

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

This runbook collects comprehensive diagnostic information:

Runs NVSSVT (NVIDIA System Software Validation Toolkit)
Collects NVSM (NVIDIA System Management) health dumps
Executes EUD (End User Diagnostics)
Runs Partnerdiag if necessary
Creates a consolidated diagnostic log dump package
Generates a Support ticket with diagnostic log for support (if opted in to support ticket service)

Prerequisites for EUD and Partnerdiag:

EUD: Binary must be installed on every compute node for execution
Partnerdiag: Binary must be installed on the head node under path /cm/shared/partnerdiag
If these binaries are not properly installed, EUD and Partnerdiag will be skipped during diagnostic collection

View Break/Fix Result#

Users can monitor break/fix operations and determine outcomes through multiple methods:

Accessing Break/Fix Results#

Via Runbook Execution View:

Click “Runs” in the upper left corner
Filter by “BREAKFIX_TRIGGER” to see all break/fix executions
Select a specific run to view detailed execution flow

Via Resource Run History:

Navigate to “Resources” in the left panel and search for the resource (e.g., “b06-p1-dgx-06-c05”)
Click on the resource name
View the “Run History” page which displays all runbooks the resource participated in, with execution timestamps and status
Filter or search for BREAKFIX related runbooks to see the complete history of remediation attempts for that specific node

Understanding Break/Fix Outcomes#

Successful Recovery Indicators:

Node status changes from “DRAINED” to “IDLE” or “ALLOCATED” in BCM
Maintenance tag is removed from the node
BREAKFIX_COMPUTE_TRAY_VALIDATION shows “PASSED” status
Node is automatically returned to service

Failed Recovery Indicators:

Node remains in “DRAINED” state
Support ticket is automatically created (if opted in to support ticket service)
BREAKFIX_DIAG_DUMP execution indicates diagnostic collection
Maintenance tag remains on the node
Drain reason is updated to add “(AHR complete)” to indicate AHR processing has finished. Note that “(AHR in-progress)” indicates Break/Fix is still processing the node.

Determining Recovery Path#

GPU Recovery Path:

Look for GPU_RECOVERY runbook execution in the workflow
Check if GPU reboot or reset actions were performed
Validation results indicate GPU functionality restoration

Power Cycle Recovery Path:

BREAKFIX_COMPUTE_TRAY_TRIAGE shows auxiliary power cycle execution
Node connectivity tests show successful recovery

Monitoring Ongoing Operations#

Break/fix operations run every 5 minutes automatically
Check the “maintenance” tag to see which nodes are currently being processed
Review recent BREAKFIX_TRIGGER executions to track system-wide break/fix operations

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

Break/Fix Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

Figure: Break/Fix Post RMA Workflow - This diagram illustrates the comprehensive post-RMA process for compute tray replacement, starting with BCM inventory updates for new MAC addresses, followed by BMC credential management, OPROM configuration, and boot order correction. The workflow then proceeds through firmware updates to ensure correct versions, concludes with system validation including agent connectivity verification and comprehensive testing via BREAKFIX_COMPUTE_TRAY_VALIDATION, and automatically returns validated hardware to service.

Key Features#

Automated Configuration: Configures replaced hardware components with proper settings
Firmware Updates: Updates firmware to match the required versions for the environment
Boot Order Correction: Ensures proper boot sequence for reliable operation
Comprehensive Validation: Performs thorough testing to verify hardware functionality
Seamless Integration: Automatically returns validated hardware to service

Post RMA Workflow Components#

The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:

Physical Replacement Procedures#

Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling
Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection
Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection

BCM Inventory Update#

ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed (GB300 uses a dedicated runbook with parameters for BMC MAC, BF3_1 MAC/Storage, host name, and optional tray serial number)
Skip this step for repaired trays as MAC addresses remain unchanged
Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)
Enables proper management and monitoring of the replaced hardware

BMC Credential Management#

Creates necessary BMC credential files for secure access to hardware components
Establishes secure communication channels for configuration operations

BlueField Configuration#

Checks if BlueField devices are in NIC mode
Enables OPROM on BlueField devices to ensure proper initialization
Configures hardware components for optimal operation

Boot Order Correction#

Ensures the boot sequence is properly configured
Prevents boot failures and improves system reliability
Performs power reset through BMC after configuration changes

Connectivity Verification#

Verifies SSH connectivity to compute nodes
Checks BCM device status to ensure proper registration
Confirms network accessibility before proceeding with firmware updates

Firmware Updates#

Upgrades compute firmware to the version in the specified package directory
Upgrades Mellanox BlueField and ConnectX firmware using the provided package file paths
Uses the golden configuration JSON (FW_SOURCE_JSON_PATH) for validation; all components are updated in a single run

System Validation#

Waits for hosts to come back online after each firmware update cycle
Verifies agent connectivity to ensure management capabilities
Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION
Opens nodes in BCM and validates Slurm readiness for successful nodes

Running the Post RMA Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new compute tray. Skip this step for repaired trays as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses and hardware identifiers using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook. This runbook updates the BCM inventory accordingly and supports provisioning the compute node and running baseline tests.

Navigate to the “BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY” runbook in the Runbooks section
Configure the required parameters (MAC addresses and serial numbers are provided by the Enterprise Support team):
- HOST_NAME (required): Host name to update in BCM inventory (e.g., a07-p1-dgx-03-c08)
- BMC_MAC (required): MAC Address of the BMC
- BF3_1_MAC (required): MAC Address of Predictable Network Interface Name enP22p3s0f0np0
- BF3_1_STORAGE (optional): MAC Address of Predictable Network Interface Name enP22p3s0f1np1
- TRAY_SERIAL_NUMBER (optional): Serial Number of the Tray
Select “Create Run” to initiate the BCM inventory update
Monitor the execution progress to ensure successful completion

Step 2: Execute Main Post RMA Workflow#

After successfully updating the BCM inventory (if required for new hardware), proceed with the main Post RMA workflow:

To execute the Post RMA workflow:

Navigate to the “BREAKFIX_POST_RMA” runbook in the Runbooks section
Configure the required parameters:
- HOST_NAME (required): Name of the replaced host (e.g., a07-p1-dgx-03-c08)
- FWPKG_DIR_PATH (required): Full path to the directory containing all firmware packages or the individual package (.fwpkg) for compute nodes
- FWPKG_BF_FILE_PATH (required): Full path to the BlueField (BF3) firmware package file
- FWPKG_CX_FILE_PATH (required): Full path to the ConnectX firmware package file
- FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation. Set to NA if not available.
- resource_tag (optional): Tag for resource query (e.g., rack_name)
- resource_value (optional): Value for resource query (e.g., rack identifier)
Configure the required secrets (if not already configured):
- AHR_API_ENDPOINT: The API endpoint URL for NVIDIA Mission Control
- AHR_TOKEN: Authentication token for API access
To add or update these secrets:
1. In the Shoreline UI, go to Settings.
2. Click on the Secrets section.
3. Use the + Secret button to create:
  - AHR_API_ENDPOINT — Provide the correct API endpoint.
  - AHR_TOKEN — Provide the secure API token.
4. If a secret already exists, click its name to update the value.
5. Click Save to persist the changes.
🔐 These secrets will be securely injected into the action at runtime.
Select “Create Run” to initiate the workflow
Monitor the execution progress and results

Note: The workflow includes detailed physical replacement instructions that must be followed before executing the automated portions. Ensure all physical replacement steps are completed as outlined in the runbook’s markdown sections.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA Powershelf#

Break/Fix Post RMA Powershelf Introduction#

The Break/Fix Post RMA Powershelf workflow brings a replaced power shelf tray back into service after an RMA. It is available for GB300 and GB200 and consists of two runbooks: updating BCM inventory for the new powershelf node (when applicable), then running the main powershelf post RMA workflow to verify BCM status, identify the PowerShelf manufacturer (LiteOn or Delta), and upgrade PSU and PMC firmware to the provided versions.

_images/powershelf_post_rma_flowchart.png

Figure: Powershelf Post RMA Workflow — BCM inventory update (if new HW), Check BCM device status, Determine manufacturer (LiteOn or Delta), Upgrade PMC and PSU firmware, Return to service.

Running the Post RMA Powershelf Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW POWERSHELF HARDWARE)#

Note: This step is ONLY required when installing a new power shelf tray. Skip for repaired trays where the BMC MAC is unchanged.

Use the BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY runbook to register the new powershelf in BCM:

Navigate to the “BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY” runbook in the Runbooks section.
Configure the required parameters:
- HOST_NAME (required): Host name to update in BCM inventory (e.g., b06-p01-pwr-01).
- BMC_MAC (required): MAC address of the powershelf BMC.
Select “Create Run” and monitor the execution.

Step 2: Execute Main Powershelf Post RMA Workflow#

After updating BCM inventory (if required for new hardware), run the main powershelf post RMA workflow:

Navigate to the “BREAKFIX_POWERSHELF_POST_RMA” runbook in the Runbooks section.
Configure the required parameters:
- resource_value (required): Scope value (e.g., rack name) for the workflow.
- HOST_NAME (required): Name of the replaced powershelf host (e.g., b06-p01-pwr-01).
- PSU_FILE_FULL_PATH (required): Full path to the PSU tar file for powershelf nodes.
- PMC_FILE_FULL_PATH (required): Full path to the PMC tar file for powershelf nodes.
- FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation.
Select “Create Run” to start the workflow. The runbook checks BCM device status, determines the PowerShelf manufacturer (LiteOn or Delta), and upgrades PSU and PMC firmware via the firmware upgrade step. Monitor execution until completion.

NVIDIA Mission Control autonomous hardware recovery Domain Triage#

Domain Triage Introduction#

The BREAKFIX_DOMAIN_TRIAGE runbook provides manual diagnostics and troubleshooting for NVSwitch and NVLink domain-level issues. This workflow is only manually triggered on demand when domain-level problems are identified; it is not automatically triggered by BREAKFIX_TRIGGER (unlike the Break/Fix Workflow, which is triggered automatically). This workflow is designed to collect comprehensive diagnostic information when problems are detected at the domain level, facilitating efficient resolution and minimizing system downtime. This runbook is available for GB300 and GB200 only.

Figure: Domain Triage Workflow - This diagram illustrates the comprehensive domain-level diagnostic process for NVSwitch and NVLink issues. The workflow begins with compute node management by adding maintenance tags and draining nodes to prevent workload interference. It then proceeds through NVSwitch credential management to establish secure access, followed by extensive diagnostic data collection including BMC log dumps, NVDebug analysis, Nvlmapper connectivity checks, and PartnerDiag hardware diagnostics. The process concludes with case management that consolidates all diagnostic information into a comprehensive package and creates support tickets with attached logs for efficient troubleshooting.

Domain Triage Workflow Components#

The Domain Triage workflow consists of several key steps that ensure thorough diagnosis of NVSwitch and NVLink domain issues:

Compute Node Management#

Adds AHR maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workloads from running during diagnostics (no jobs will be scheduled on the entire rack)

NVSwitch Credential Management#

Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to NVSwitch components for diagnostics
Collects system rack serial numbers for identification

Diagnostic Data Collection#

Dumps BMC logs from NVSwitches to capture hardware-level events
Runs NVDebug tool to collect detailed information about NVSwitch status
Executes Nvlmapper tool to check NVLink status and connectivity (binary must be installed under /tmp/nvlmapper; see PID for download)

Switch Firmware Upgrade#

Upgrades NVSwitch firmware to the version provided in the specified package directory, using the golden configuration JSON for validation

PartnerDiag#

Runs PartnerDiag for comprehensive hardware diagnostics (binary must be installed under /tmp/partnerdiag; see PID for download)

Case Management#

Collects and organizes all diagnostic logs into a single package
Creates a support ticket with the consolidated diagnostic package (default message: NMC_BREAKFIX_NVSwitch; details: “Potential NVSwitch or NVLink issue”)
Attaches detailed logs to facilitate efficient troubleshooting

Running the Domain Triage Workflow#

Note: Domain Triage must be manually triggered on demand and is not an automated trigger runbook like BREAKFIX_TRIGGER.

To manually execute the Domain Triage workflow:

Navigate to the “BREAKFIX_DOMAIN_TRIAGE” runbook in the Runbooks section
Configure the required parameters:
- resource_tag: Tag for the resource query (use rack_name for rack filtering).
- resource_value: Value for the resource query; set to the rack to diagnose (e.g., B05). Use the same value as you would for RACK_NAME.
- SWITCH_FWPKG_DIR_PATH: Full path to the directory containing all firmware packages for the switch node.
- FW_SOURCE_JSON_PATH: Path to the golden configuration JSON file (from the firmware package) that defines the reference settings used for validation.
Optionally set:
- IGNORE_LIST: Comma-separated list of nodes to exclude from scope (e.g., node01,node02), or none.
- MESSAGE: Message for the support ticket (default: NMC_BREAKFIX_NVSwitch).
- SYSTEM_SERIAL, BMC_CRED_FILE, LOG_FILE_PATH: For support ticket and credential/log paths when needed.
Select “Create Run” to initiate the workflow
Monitor the execution progress and results

Domain Triage Results#

Accessing Domain Triage Results#

Via Runbook Execution View:

Navigate to “Runbooks” in the left panel
Click “Runs” in the upper left corner
Filter by “BREAKFIX_DOMAIN_TRIAGE” to see all domain triage executions
Select a specific run to view detailed execution flow

Via Resource Run History:

Navigate to “Resources” in the left panel and search for rack name (e.g., “b07”)
Click on any compute node from the affected rack
View the “Run History” page to see all domain triage executions for that resource
Filter for “BREAKFIX_DOMAIN_TRIAGE” to see domain-level diagnostic attempts

Understanding Domain Triage Outcomes#

After successful completion of the Domain Triage workflow:

Diagnostic Data Collection:

BMC logs from all NVSwitches in the affected domain
NVDebug output containing detailed switch status information
Nvlmapper results showing NVLink connectivity status
PartnerDiag comprehensive hardware diagnostic reports

Support Ticket Creation:

Automated support ticket generation with consolidated diagnostic package (if opted in to support ticket service)
All collected logs attached to the ticket for support team analysis
Rack serial numbers and system identification included
Maintenance tags remain on compute nodes until issue resolution

Monitoring Domain Triage Progress#

During Execution:

Check compute node status - nodes should show maintenance tags
Verify diagnostic tool execution in runbook cells
Monitor support ticket creation process

Post-Execution:

Support teams receive comprehensive diagnostic information
All relevant logs are organized and accessible
System remains in maintenance mode pending resolution

NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA#

Switch Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA workflow automates the process of bringing NVSwitch components back into service after a Return Merchandise Authorization (RMA) replacement. This comprehensive workflow includes both physical switch tray replacement procedures and automated software configuration to ensure that replaced switch hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production. This runbook is available for GB300 and GB200.

Figure: Switch Post RMA Workflow - This diagram illustrates the comprehensive switch replacement process following RMA procedures. The workflow begins with mandatory BCM inventory updates to register new MAC addresses, followed by switch connectivity verification and secure credential establishment. The process continues through factory reset and Zero Touch Provisioning for clean initialization, and firmware updates to match required versions. The workflow concludes with comprehensive system validation including NMX controller verification, rebooting the compute nodes, and thorough compute tray validation to ensure the replaced switch integrates seamlessly with the existing infrastructure.

Key Features#

Physical Replacement Procedures: Detailed instructions for safe switch tray removal and installation with proper cooling and power management
Compute Node Management: Adds maintenance tags and drains compute nodes during switch replacement to prevent workload interference
Switch Connectivity Verification: Establishes and verifies SSH connectivity to replaced switch components
Factory Reset and ZTP: Performs factory reset and monitors Zero Touch Provisioning for clean initialization
Firmware Updates: Updates switch firmware to the version in the specified package directory using the golden configuration JSON for validation
System Validation: Comprehensive testing including NMX controller verification, compute node reboots, and compute tray validation

Switch Post RMA Workflow Components#

The Switch Post RMA workflow consists of several key steps that ensure replaced switch hardware is properly configured and validated:

Physical Replacement Procedures#

Switch Tray Removal: Detailed instructions for powering down the entire rack, cooling procedures, cable disconnection, and safe tray removal
Switch Tray Installation: Comprehensive installation guide covering rail migration, tray insertion, cable reconnection, and power-on sequence

Compute Node Management#

Adds maintenance tags to all compute nodes in the affected rack
Drains compute nodes from Slurm to prevent workload interference during switch replacement

BCM Inventory Update#

ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY when a new switch is installed
Skip this step for repaired switches as MAC addresses remain unchanged
Ensures BMC MAC and COMe MAC addresses are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage asset inventory for customer deployments)
Enables proper management and monitoring of the replaced switch hardware

Switch Connectivity and Configuration#

Retrieves switch IP and credentials from BCM
Verifies SSH connectivity to the switch node
Updates ZTP settings in BCM with NVOS image file configuration
Ensures the switch is reachable for configuration operations

Factory Reset and ZTP#

Performs factory default reset on the switch
Monitors Zero Touch Provisioning (ZTP) status until successful completion
Creates support tickets if ZTP fails and exits workflow (if opted in to support ticket service)

Switch BMC Credential Management#

Retrieves BMC credentials for NVSwitches from BCM
Establishes secure access to switch components for diagnostics
Collects system rack serial numbers for identification

Firmware Updates#

Upgrades switch firmware to the version in the specified package directory (SWITCH_FWPKG_DIR_PATH) using the golden configuration JSON (FW_SOURCE_JSON_PATH) for validation
Verifies switch connectivity after firmware updates

System Health and Validation#

Performs switch tray health checks
Verifies NMX-C and NMX-T controller status on the active switch node
Reboots all compute nodes in the rack to ensure proper connectivity
Waits for agent connectivity to confirm successful recovery
Validates there are no inactive NVLinks
Runs comprehensive compute tray validation using BREAKFIX_COMPUTE_TRAY_VALIDATION

Running the Switch Post RMA Workflow#

Step 1: Update Switch BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new switch. Skip this step for repaired switches as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses:

Navigate to the “BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY” runbook in the Runbooks section
Configure the required parameters (MAC addresses are provided by the Enterprise Support team):
- SWITCH_HOSTNAME: The hostname of the replaced switch component
- BMC_MAC: BMC MAC Address from Asset File
- COMe_MAC_1: COMe MAC 1 Address From Asset File
- COMe_MAC_2: COMe MAC 2 Address From Asset File
Select “Create Run” to initiate the BCM inventory update
Monitor the execution progress to ensure successful completion

Step 2: Execute Main Switch Post RMA Workflow#

After successfully updating the BCM inventory, proceed with the main Switch Post RMA workflow:

Navigate to the “BREAKFIX_SWITCH_POST_RMA” runbook in the Runbooks section
Configure the required parameters:
- SWITCH_HOST_NAME (required): Input replaced switch host name.
- resource_tag (required): Tag for the resource query (use rack_name for rack filtering).
- resource_value (required): Value for the resource query; set to the rack scope (e.g., A01, or use regex/list format as supported by flex query).
- SWITCH_FWPKG_DIR_PATH (required): Full path to the directory containing all firmware packages for the switch node.
- NVOS_IMAGE_FILE_NAME (required): Latest NVOS image file name from the headnode under the path /cm/local/apps/cmd/etc/htdocs/switch/image. If the latest image is not found, copy the image file to this path and provide the file name.
- FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON that defines the reference settings used for validation.
- IGNORE_LIST (optional): List of nodes to exclude from scope (e.g., node01, or none).
- MESSAGE (optional): Message for support ticket creation if ZTP fails (default: AHR_BREAKFIX_NVSwitch_POST_RMA).
Configure the required secrets (if not already configured): AHR_API_ENDPOINT and AHR_TOKEN (used for agent connectivity checks and compute tray validation). Add or update them under Settings → Secrets in the Shoreline UI.
Select “Create Run” to initiate the workflow
Monitor the execution progress and results

GB200#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to complete racks or multiple rack configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron(Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one click implementation of all testing procedures.

To access the baseline testing interface:

Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)
Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the interface shown below. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

Run SRT (Single Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Rack Configuration when one or multiple racks are ready for testing.

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.
Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.
Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.
Select the “Save” button positioned in the top right corner of the interface to preserve your settings.
Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.
To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

You are required to provide the following inputs for the runbook

Resource filtering (flex query): Runbooks use a flexible resource query instead of hardcoding a single rack. You specify:
- resource_tag (required): Tag for the flex resource query. Use rack_name for rack-based filtering (e.g., to run on a specific rack or set of racks).
- resource_value (required): Value for the flex resource query. Set to the same value you would have used for the rack name (e.g., B05, A01, m06|m07 for multiple racks), or use “none” if you do not want to filter by that tag. This allows you to target any resource—not just a single hardcoded rack.
- FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.
- IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:
  - Single node: node01
  - List format: [“node01”, “node02”]
  - Pipe-delimited string: node01|node02
After entering the correct resource_tag and resource_value (e.g., resource_tag=rack_name, resource_value=m06 or m06|m07), select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

IMPORTANT NOTES:

How to get the value for resource_value (e.g. rack_name):
1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value
2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
3. Click the button at right top “New Runbook” then you will see below.
4. In the central page, click “Op Statement” to create your first cell to query the resource
5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as below example.
6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not show up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.
Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Run MRT (Multi Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Rack Configuration.

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.
Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.
Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.
Select the “Save” button positioned in the top right corner of the interface to preserve your settings.
Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.
Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

The identifier represents the resource filter. For rack-based runs, use resource_tag=rack_name and resource_value set to the rack designation. The system accommodates both single and multiple rack configurations, as detailed below:

Single Rack Format:
- Standard notation: resource_value (Example: m06)
Multiple Rack Format:
- Standard notation: resource_value|resource_value (Example: m06|m07)
- Critical: No spaces are permitted between rack identifiers and the delimiter (|)
After entering the correct resource_tag and resource_value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

IMPORTANT NOTES:

how to get the value for resource_value (e.g. rack_name):
1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value
2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.
3. Click the button at right top “New Runbook” then you will see below.
4. In the central page, click “Op Statement” to create your first cell to query the resource
5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as in the below example.
6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not showing up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.
Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

Below is the list of all mission control related runbooks including its name and description.

Category	Runbook Name	Description
	DGX_SUPERPOD_BASELINE_TESTING	EntryPoint Runbook
SRT	DGX_SUPERPOD_BASELINE_TESTING_SRT	EntryPoint Runbook of SRT
	SRT1	EntryPoint Runbook of all single node health checks
	SINGLENODE_HEALTHCHECK_GPU_CPU	Baseline health checks for GPU and CPU
	SINGLENODE_HEALTHCHECK_MEMORY_STORAGE	Baseline health checks for Memory and Storage
	SINGLENODE_HEALTHCHECK_NETWORK	Baseline health checks for Network
	SINGLENODE_HEALTHCHECK_SOFTWARE	Baseline health checks for installed software
	SINGLENODE_HEALTHCHECK_FIRMWARE	Baseline health checks for firmware
	SRT2	EntryPoint Runbook of component testing
	SR_MEMORY_BENCHPRESS	Benchmark testing for memory
	SR_CUDA_SAMPLES	Benchmark testing for CUDA
	SR_P2P_IPERF	Benchmark testing for pairwise ethernet interfaces
	HPL_TEST_SINGLE_NODE	HPL testing on single node separately
	HPL_TEST	HPL testing on the single rack
	SR_NVBANDWIDTH	Bandwidth testing running in single nvldomain
	NCCL_TEST	NCCL testing on the single rack
	SRT3	EntryPoint Runbook of burn-in performance testing
	HPL_TEST_BURN_IN	HPL testing on the single rack with long duration
MRT	DGX_SUPERPOD_BASELINE_TESTING_MRT	EntryPoint Runbook of MRT
	MRT1	EntryPoint Runbook of rack level connectivity testing
	MR_INFINIBAND_CHECK	InfiniBand connectivity check
	MRT2	EntryPoint Runbook of multi-rack performance testing
	MR_HPL_TEST	HPL testing cross multiple racks
	MR_NCCL_TEST	NCCL testing cross multiple racks
	MR_HPL_TEST_BURN_IN	HPL testing cross multiple racks with long duration
	MR_NCCL_TEST_BURN_IN	NCCL testing cross multiple racks with long duration
	MRT3	EntryPoint Runbook of cluster level testing
	Nemotron_15B	LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

Each row represents an individual cell
Each cell includes a play button for isolated execution
Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

Parameters Contains all required inputs for runbook execution
Triggers Configure automated execution methods:
- Alarm triggers
- Time Trigger (cron jobs)
- Other integrations like AlertManager
Users Manage permissions for who may run or edit the runbook
Settings General runbook configuration options
… more runbook operations including:
- Clone functionality
- Export options
- Delete runbook
- etc.

Check Job Status and Result#

Once you’ve initiated the SRT or MRT using the steps above, there are 2 options to check the job status and results:

Click “View Run” right after you “Created the Run” in above sections to redirect you to another page
Navigate to the “Runbook” section in the left panel, then click the “Run” button located in the upper left corner of the page as shown below. Important: Make sure you’ve selected the correct range in the upper right corner before proceeding.

Note: Runbooks can be nested within other runbooks. When this occurs, you may see a “Execution succeeded - View Run” link after a cell completes. Clicking this link will redirect you to a detailed results page for the nested runbook execution.

In the following part of this section, we will walk you through the topics below.

Check Job status - Whether the job is Running, Completed, Aborted, Terminated or Timed out.
Check Job results - if the job passed or failed, along with detailed logs

Check Job Status#

At the top of each job execution, you will observe one of the following status indicators:

Status Types and Definitions#

Running The job is currently executing. Progress is displayed as a percentage based on completed cells.
Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.
Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.
Terminated The job was forcibly ended by the system.
Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)
Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.
Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.
Results Section: The bottom portion displays:

Exit code status
Execution location information
Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

Configure output display preferences
Toggle the density
Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings”

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SRT or MRT job is recommended.
Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.
Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them. If any execution of the firmware validation runbook results in versions not found, please re-run the runbook.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SRT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

The Landing page contains two tabs: Report Templates and Published Reports.

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SRT and MRT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SRT and MRT tests have been executed.

Guide to initiate Reporting for Baseline Testing#

Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.
Optional: While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.
Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.
Retain the auto-generated name for the report which includes the timestamp or provide your own name.
Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SRT and MRT linked reports.
A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.
When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SRT and MRT reports, with the data captured at that given time.
All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

Navigating to Previously Published Reports#

Navigate to the Reports from the NVIDIA Mission Control autonomous hardware recovery Home Page
Click on “Published Reports”, and select the “DGX_SUPERPOD_BASELINE_TESTING_REPORT_timestamp” or any other report of interest.
You can also adjust the time range at the top of the screen to view reports generated within specific time frames. Options include viewing reports from the last 10 minutes, last 1 hour, or selecting a custom time range to explore older data beyond these periods.

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#

“DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SRT/MRT tests for the compute nodes.
Each stage is represented by a separate cell, displaying the results for that specific SRT or MRT test.
To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.
Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SRT and MRT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SRT1, SRT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

The Y-axis displays the Rack name, allowing you to quickly identify the location of each resource within the cluster.
The X-axis represents the number of trays within each rack.

For example, in the following visualization, each bar graph indicates the number of trays within a rack. In this case, there are 2 trays per rack, as denoted by the number displayed on the bars. This provides a clear view of the test progress for each tray across different racks. ** note: for NVL72, you will have 18 trays per rack in the report.

You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.
Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SRT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.
Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.
In this example, GPU VBIOS Version has failed for all the trays. Click on the bar for one of the racks to list the resources that the test failed on.
Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.
The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (VBIOS). Alternatively, you can also scroll through the run to find the failed tests.
Click on the “Command filter excluded x/y resources” to get detailed output for each resource.
Click on the Output to view the detailed logs

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.
Expected Version: Shows the firmware version that was expected for the node.
Current Version: Displays the actual firmware version currently installed on the node.

Navigating to Tabular Report for Firmware Checks#

You can access the SINGLENODE_HEALTHCHECK_FIRMWARE_<timestamp> report either through the Runs (Check Job Status and Result) section or by navigating to the Runs History (see Root Cause Analysis).
Scroll to the last cell of the execution and click on the output of the cell.
You can scroll horizontally and vertically to view the complete output. Additionally, you have the option to download the cell output for further analysis.
You can also navigate to the Reports section and view the tabular output by clicking on the status bar. This includes both the expected firmware version and the current firmware version.

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to the Resource Dashboard#

Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.
The Landing page contains two tabs: Dashboards and Dashboard Views.
The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.
A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SRT and MRT.

Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.
Hostname & Rack Information: For each resource, you will see the hostname and rack name, along with a Tag sequence that indicates the stage of both the SRT and MRT tests.
Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SRT and MRT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.
Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

The dashboard allows you to sort the resources by Name, Rack, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

Name: Sort resources alphabetically by their hostname for quick access.
Rack: Sort resources based on their rack assignment, ideal for organizing by physical location.
Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

Click on the “Create View” button located at the top right corner of your screen.
Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.
Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.
The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

Prolog Checks (run before the job starts):
- If a check fails, the node is marked as DRAIN, and the job is re-queued.
- If it passes, the job proceeds normally.
Epilog Checks (run after the job finishes):
- If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

SLURM_CHECKS_ENABLE – enables the checks
SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_ENABLE” AND “ALARMS_DISABLE” runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI.

Automatically enabled for racks that pass Single Rack Testing
Automatically disabled during firmware upgrade and Break/fix
Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.
ALARMS_DISABLE – disables them