Connecting to NVIDIA Mission Control autonomous hardware recovery#

NVIDIA Mission Control autonomous hardware recovery’s authentication relies on BCM’s LDAP authentication so that users will leverage their BCM credentials to login to NVIDIA Mission Control autonomous hardware recovery.

Upon successful authentication, the user session receives a short lived JWT and refresh token. The JWT is used for identity and refreshed by the UI at expiration time.

An administrator can opt to manage users and groups using BCM. Any changes will automatically be reflected within NVIDIA Mission Control autonomous hardware recovery. There are two options to access NVIDIA Mission Control autonomous hardware recovery. The first option is to authenticate using SSO with BCM identity. The other option is to go to the url for the NVIDIA Mission Control autonomous hardware recovery UI. You will be presented with a login screen.

Login screen

_images/image7.png

Once you click the login button, the authentication page appears where you can enter credentials.

_images/image8.png

Main landing page

_images/image9.png

Users are authorized to perform different operations within NVIDIA Mission Control autonomous hardware recovery by configuring permission policies. Policies determine if the user can view resources and execute actions (named and/or anonymous). Action execution can be limited to a maximum number of impacted resources and / or specific resources. Permissions can also be attached to runbooks to allow / disallow certain users or groups. Please refer to documentation on Access Control for additional details.

New users are first created in BCM before they are able to access NVIDIA Mission Control autonomous hardware recovery. Upon logging in, there is a default permission policy that every user is assigned. The permissions of this policy are determined by the administrator. The administrator role has permission to perform any action in NVIDIA Mission Control autonomous hardware recovery. The configurator role has the permission to create, edit, delete any artifacts in NVIDIA Mission Control autonomous hardware recovery. By default, new users are granted administer and configure roles until the privileges are overridden. Defaults can be modified in the Access Control section of the NVIDIA Mission Control autonomous hardware recovery UI.

GB300#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to complete racks or multiple rack configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron(Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Baseline testing: Single Rack or Multi Rack test mode, then status, firmware checks, resource dashboard

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one click implementation of all testing procedures.

To access the baseline testing interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)

  • Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the following interface. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

Run SRT (Single Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Rack Configuration when one or multiple racks are ready for testing.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/image13.png
  • To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

    _images/image14.png

    You are required to provide the following inputs for the runbook

    Resource filtering (flex query): Runbooks use a flexible resource query instead of hardcoding a single rack. You specify:

    • resource_tag (required): Tag for the flex resource query. Use rack_name for rack-based filtering (e.g., to run on a specific rack or set of racks).

    • resource_value (required): Value for the flex resource query. Set to the same value you would have used for the rack name (e.g., B05, A01, m06|m07 for multiple racks), or use “none” if you do not want to filter by that tag. This allows you to target any resource—not just a single hardcoded rack.

    • FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

    • IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:

      • Single node: node01

      • List format: [“node01”, “node02”]

      • Pipe-delimited string: node01|node02

  • After entering the correct resource_tag and resource_value (e.g., resource_tag=rack_name, resource_value=m06 or m06|m07), select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/image15.png

    IMPORTANT NOTES:

  1. How to get the value for resource_value (e.g. rack_name):

    1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value

    2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

      _images/image16.png
    3. Click the button at right top “New Runbook” then you will see below.

      _images/image17.png
    4. In the central page, click “Op Statement” to create your first cell to query the resource

    5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as below example.

      _images/image18.png
    6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not show up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.

      _images/image19.png
  2. Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Run MRT (Multi Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Rack Configuration.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/image20.png
  • Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

    _images/image14.png

    The identifier represents the resource filter. For rack-based runs, use resource_tag=rack_name and resource_value set to the rack designation. The system accommodates both single and multiple rack configurations, as detailed below:

    Single Rack Format:

    • Standard notation: resource_value (Example: m06)

    Multiple Rack Format:

    • Standard notation: resource_value|resource_value (Example: m06|m07)

    • Critical: No spaces are permitted between rack identifiers and the delimiter (|)

  • After entering the correct resource_tag and resource_value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/image15.png

IMPORTANT NOTES:

  1. how to get the value for resource_value (e.g. rack_name):

    1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value

    2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

      _images/image16.png
    3. Click the button at right top “New Runbook” then you will see below.

      _images/image17.png
    4. In the central page, click “Op Statement” to create your first cell to query the resource

    5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as in the below example.

      _images/image18.png
    6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not showing up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.

      _images/image19.png
  2. Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

_images/image21.png

Below is the list of all mission control related runbooks including its name and description.

Category

Runbook Name

Description

DGX_SUPERPOD_BASELINE_TESTING

EntryPoint Runbook

SRT

DGX_SUPERPOD_BASELINE_TESTING_SRT

EntryPoint Runbook of SRT

SRT1

EntryPoint Runbook of all single node health checks

SINGLENODE_HEALTHCHECK_GPU_CPU

Baseline health checks for GPU and CPU

SINGLENODE_HEALTHCHECK_MEMORY_STORAGE

Baseline health checks for Memory and Storage

SINGLENODE_HEALTHCHECK_NETWORK

Baseline health checks for Network

SINGLENODE_HEALTHCHECK_SOFTWARE

Baseline health checks for installed software

SINGLENODE_HEALTHCHECK_FIRMWARE

Baseline health checks for firmware

SRT2

EntryPoint Runbook of component testing

SR_MEMORY_BENCHPRESS

Benchmark testing for memory

SR_CUDA_SAMPLES

Benchmark testing for CUDA

SR_P2P_IPERF

Benchmark testing for pairwise ethernet interfaces

HPL_MXP_TEST_SINGLE_NODE_MPIRUN

HPL_MXP testing on single node separately

HPL_MXP_TEST_MPIRUN

HPL_MXP testing on the single rack

SR_NVBANDWIDTH

Bandwidth testing running in single nvldomain

NCCL_TEST

NCCL testing on the single rack

SRT3

EntryPoint Runbook of burn-in performance testing

HPL_MXP_TEST_BURN_IN_MPIRUN

HPL_MXP testing on the single rack with long duration

MRT

DGX_SUPERPOD_BASELINE_TESTING_MRT

EntryPoint Runbook of MRT

MRT1

EntryPoint Runbook of rack level connectivity testing

MR_INFINIBAND_CHECK_UFM

InfiniBand connectivity check via UFM

IB_PERF_TEST_SINGLE_NODE

InfiniBand performance test on single node

MRT2

EntryPoint Runbook of multi-rack performance testing

MR_HPL_TEST

HPL testing cross multiple racks

MR_NCCL_TEST

NCCL testing cross multiple racks

MR_HPL_TEST_BURN_IN

HPL testing cross multiple racks with long duration

MR_NCCL_TEST_BURN_IN

NCCL testing cross multiple racks with long duration

MRT3

EntryPoint Runbook of cluster level testing

Nemotron_15B

LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

_images/image20.png
Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

  • Each row represents an individual cell

  • Each cell includes a play button for isolated execution

  • Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

  1. Parameters Contains all required inputs for runbook execution

  2. Triggers Configure automated execution methods:

    • Alarm triggers

    • Time Trigger (cron jobs)

    • Other integrations like AlertManager

  3. Users Manage permissions for who may run or edit the runbook

    drawing
  4. Settings General runbook configuration options

    drawing
  5. more runbook operations including:

    • Clone functionality

    • Export options

    • Delete runbook

    • etc.

    drawing

Check Job Status and Result#

Once you’ve initiated the SRT or MRT using the steps above, there are 2 options to check the job status and results:

  1. Click “View Run” right after you “Created the Run” in above sections to redirect you to another page

    _images/image15.png
  2. Navigate to the “Runbook” section in the left panel, then click the “Run” button located in the upper left corner of the page as shown below. Important: Make sure you’ve selected the correct range in the upper right corner before proceeding.

    _images/image25.png

    Note: Runbooks can be nested within other runbooks. When this occurs, you may see a “Execution succeeded - View Run” link after a cell completes. Clicking this link will redirect you to a detailed results page for the nested runbook execution.

    _images/image26.png

In the following part of this section, we will walk you through the topics below.

  1. Check Job status - Whether the job is Running, Completed, Aborted, Terminated or Timed out.

  2. Check Job results - if the job passed or failed, along with detailed logs

Check Job Status#

At the top of each job execution, you will observe one of the following status indicators:

_images/image27.png
Status Types and Definitions#
  1. Running The job is currently executing. Progress is displayed as a percentage based on completed cells.

  2. Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.

  3. Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.

  4. Terminated The job was forcibly ended by the system.

  5. Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)

  6. Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

_images/image28.png
  1. Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.

    _images/image29.png
  2. Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.

  3. Results Section: The bottom portion displays:

  • Exit code status

  • Execution location information

  • Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

  • Configure output display preferences

  • Toggle the density

  • Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

_images/image30.png

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

  • Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SRT or MRT job is recommended.

  • Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.

  • Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

    _images/image30.png

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

The Landing page contains two tabs: Report Templates and Published Reports.

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

Health Checks & Alerts#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels for GB200. In addition, system wide health checks are performed by integrating with the UFM and NetQ network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks through NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

Alarms dashboard overview

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

Prolog Checks (run before the job starts):

  • If a check fails, the node is marked as DRAIN, and the job is re-queued.

  • If it passes, the job proceeds normally. Epilog Checks (run after the job finishes):

  • If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

  • SLURM_CHECKS_ENABLE – enables the checks

  • SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the ALARMS_ENABLE AND ALARMS_DISABLE runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI and perform the following:

  • Automatically enabled for racks that pass Single Rack Testing

  • Automatically disabled during firmware upgrade and Break/fix

  • Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

  • ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.

  • ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

The system will run the following checks to check your system on a regular, 5 minute time period.

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state using ibstat, and also matching the expected transfer rate.

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms. These checks consist of:

  • check_bmc_ipmi_version : Checks BMC IPMI version against an expected value

  • check_nvidia_module_loaded : Verifies the NVIDIA module is loaded in the host OS

  • check_host_os_version : Verifies the DGX OS version matches the expected value

  • check_nvsm_status : Verify the NVSM service is currently active

periodic_cpu_mem_checks#

The following groups of periodic functional checks system memory:

  • check_cpu_health : Verifies CPU sockets and cores are present and online

  • check_dimm_count : Checks that all expected memory DIMMs are present

  • check_dimm_size : Checks that the size of each memory DIMM matches the expected values

  • check_memory_swap_size : Checks that the memory swap size matches the expected value

periodic_network_checks#
  • check_ib_ber_and_ro : Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the CX7 using mlxlink

  • check_ib_port_rcv_errors :Check Infiniband devices port RCV errors

  • check_ib_cables : Checks the cable info using mlxcables

  • check_bf3_speed : Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#
  • check_pex_switch_health : Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci

  • check_cx_config : Checks that the ConnectX devices have the correct PCIe link speed and width using lspci and ACS config using setpci

  • check_nvme_health : Checks that the PCIe link speed and width of each NMVe device matches the expected value

  • check_storage_dir : Checks that the host has functional access to the home storage

  • check_storage_util : Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

This setting checks for issues where the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

  • check_sel_event : Read the SEL events from the BMC and ensure none are asserted

  • check_dgx_os_version : Verifies the DGX OS version matches the expected value

  • check_gpu_vbios_ver : Checks the VBIOS version of the GPUs and compares against an expected value

  • check_nvme_fw_ver : Checks that the FW version for each NVMe matches an expected value

  • check_kernel_commandline_opt : Verifies the specified kernel option(s) is present in the current kernel’s boot parameters

  • check_host_bios_ver : Verifies the system’s BIOS version

  • check_kernel_ver : Verifies the current version of the Linux kernel

  • check_host_package_versions : Queries the installed packages on the host

  • nv_container_cli_info : Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval, and Automation. An example configuration is shown in the following figure.

Alarm configuration example

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | rack_name =~ “.*”` will only check alarms on hosts which have a value set for the “rack_name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Alarm states screenshot

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB200 racks. The four distinct components for which firmware can be upgraded using this process are:

  • Compute trays

  • Switches

  • Mellanox

  • NVOS

The workflow invocation is performed using autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search using the FIRMWARE_UPGRADE label.

Note

To do firmware updates within Base Command Manager or the nvfwupd tool itself, refer to the NVIDIA DGX GB200/GB300 Firmware Update Guide.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB200. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets, when opted in to the support ticket service, for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

Key Features#

  • Automatic Detection: Identifies drained nodes without manual intervention

  • Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms

  • Comprehensive Diagnostics: Performs thorough hardware and software checks

  • Automated Remediation: Attempts to resolve issues without human intervention when possible

  • Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute trays. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry point runbook that:

  • Runs automatically every five minutes using the time trigger

  • Checks for any drained nodes in BCM

  • Initiates the triage process for affected nodes

  • Routes to the appropriate diagnostic workflow

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

Key Features#

  • Automated Configuration: Configures replaced hardware components with proper settings

  • Firmware Updates: Updates firmware to match the required versions for the environment

  • Boot Order Correction: Ensures proper boot sequence for reliable operation

  • Comprehensive Validation: Performs thorough testing to verify hardware functionality

  • Seamless Integration: Automatically returns validated hardware to service

Post RMA Workflow Components#

The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:

Physical Replacement Procedures#

  • Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling

  • Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection

  • Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection

BCM Inventory Update#

  • ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed

  • Skip this step for repaired trays as MAC addresses remain unchanged

  • Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)

  • Enables proper management and monitoring of the replaced hardware

BMC Credential Management#

  • Creates necessary BMC credential files for secure access to hardware components

  • Establishes secure communication channels for configuration operations

BlueField Configuration#

  • Checks if BlueField devices are in NIC mode

  • Enables OPROM on BlueField devices to ensure proper initialization

  • Configures hardware components for optimal operation

Boot Order Correction#

  • Ensures the boot sequence is properly configured

  • Prevents boot failures and improves system reliability

  • Performs power reset through BMC after configuration changes

Connectivity Verification#

  • Verifies SSH connectivity to compute nodes

  • Checks BCM device status to ensure proper registration

  • Confirms network accessibility before proceeding with firmware updates

Firmware Updates#

  • Compute Firmware: Updates compute firmware using the BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA runbook

  • Mellanox Firmware: Updates BlueField and ConnectX firmware using the BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA runbook

  • Ensures all hardware components are running the correct firmware versions

System Validation#

  • Waits for hosts to come back online after each firmware update cycle

  • Verifies agent connectivity to ensure management capabilities

  • Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION

  • Opens nodes in BCM and validates Slurm readiness for successful nodes

Post RMA Workflow Results#

After successful completion of the Post RMA workflow:

Hardware Configuration:

  • Physical components (M.2, E1.S drives, HMC, BMC, TPM) properly migrated to new tray

  • BlueField devices configured in NIC mode with OPROM enabled

  • Boot order corrected for reliable system startup

  • Power management and connectivity verified

Firmware Updates:

  • Compute firmware updated to specified versions using BREAKFIX_FIRMWARE_UPGRADE_COMPUTE_POST_RMA

  • Mellanox BlueField and ConnectX firmware updated to specified versions using BREAKFIX_FIRMWARE_UPGRADE_MELLANOX_POST_RMA

  • All firmware components validated against expected versions

System Integration:

  • SSH connectivity to compute nodes verified

  • BCM device status confirmed and registered

  • Agent connectivity established for management capabilities

  • Comprehensive validation tests passed using BREAKFIX_COMPUTE_TRAY_VALIDATION

Service Restoration:

  • Nodes automatically opened in BCM for job scheduling

  • Slurm readiness validated for successful nodes

  • Hardware returned to production service automatically

Failure Handling: For any components that fail validation:

  • System maintains them in non-production state with maintenance tags

  • Detailed error logs available in runbook execution cells

  • Manual intervention required to address specific failure causes

  • Nodes remain drained until issues are resolved

NVIDIA Mission Control autonomous hardware recovery Domain Triage#

The BREAKFIX_DOMAIN_TRIAGE runbook provides manual diagnostics and troubleshooting for NVSwitch and NVLink domain-level issues. This workflow is only manually triggered on demand when domain-level problems are identified; it is not automatically triggered by BREAKFIX_TRIGGER (unlike the Break/Fix Workflow, which is triggered automatically). This workflow is designed to collect comprehensive diagnostic information when problems are detected at the domain level, facilitating efficient resolution and minimizing system downtime.

Domain Triage Workflow Components#

The Domain Triage workflow consists of several key steps that ensure thorough diagnosis of NVSwitch and NVLink domain issues:

Compute Node Management#

  • Adds AHR maintenance tags to all compute nodes in the affected rack

  • Drains compute nodes from Slurm to prevent workloads from running during diagnostics (no jobs will be scheduled on the entire rack)

NVSwitch Credential Management#

  • Retrieves BMC credentials for NVSwitches from BCM

  • Establishes secure access to NVSwitch components for diagnostics

  • Collects system rack serial numbers for identification

Diagnostic Data Collection#

  • Dumps BMC logs from NVSwitches to capture hardware-level events

  • Runs NVDebug tool to collect detailed information about NVSwitch status

  • Executes Nvlmapper tool to check NVLink status and connectivity

  • Runs PartnerDiag for comprehensive hardware diagnostics

Case Management#

  • Collects and organizes all diagnostic logs into a single package

  • Creates a Support ticket with all relevant diagnostic information

  • Attaches detailed logs to facilitate efficient troubleshooting

NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA#

Switch Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA workflow automates the process of bringing NVSwitch components back into service after a Return Merchandise Authorization (RMA) replacement. This comprehensive workflow includes both physical switch tray replacement procedures and automated software configuration to ensure that replaced switch hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

Key Features#

  • Physical Replacement Procedures: Detailed instructions for safe switch tray removal and installation with proper cooling and power management

  • Compute Node Management: Adds maintenance tags and drains compute nodes during switch replacement to prevent workload interference

  • Switch Connectivity Verification: Establishes and verifies SSH connectivity to replaced switch components

  • Factory Reset and ZTP: Performs factory reset and monitors Zero Touch Provisioning for clean initialization

  • Firmware Updates: Updates switch firmware to match required versions using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA

  • System Validation: Comprehensive testing including NMX controller verification, compute node reboots, and compute tray validation

Switch Post RMA Workflow Components#

The Switch Post RMA workflow consists of several key steps that ensure replaced switch hardware is properly configured and validated:

Physical Replacement Procedures#

  • Switch Tray Removal: Detailed instructions for powering down the entire rack, cooling procedures, cable disconnection, and safe tray removal

  • Switch Tray Installation: Comprehensive installation guide covering rail migration, tray insertion, cable reconnection, and power-on sequence

Compute Node Management#

  • Adds maintenance tags to all compute nodes in the affected rack

  • Drains compute nodes from Slurm to prevent workload interference during switch replacement

BCM Inventory Update#

  • ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY when a new switch is installed

  • Skip this step for repaired switches as MAC addresses remain unchanged

  • Ensures BMC MAC and COMe MAC addresses are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage asset inventory for customer deployments)

  • Enables proper management and monitoring of the replaced switch hardware

Switch Connectivity and Configuration#

  • Retrieves switch IP and credentials from BCM

  • Verifies SSH connectivity to the switch node

  • Updates ZTP settings in BCM with NVOS image file configuration

  • Ensures the switch is reachable for configuration operations

Factory Reset and ZTP#

  • Performs factory default reset on the switch

  • Monitors Zero Touch Provisioning (ZTP) status until successful completion

  • Creates support tickets if ZTP fails

Switch BMC Credential Management#

  • Retrieves BMC credentials for NVSwitches from BCM

  • Establishes secure access to switch components for diagnostics

  • Collects system rack serial numbers for identification

Firmware Updates#

  • Upgrades switch firmware to the specified version using BREAKFIX_FIRMWARE_UPGRADE_SWITCH_POST_RMA

  • Verifies switch connectivity after firmware updates

System Health and Validation#

  • Performs switch tray health checks

  • Verifies NMX-C and NMX-T controller status on the active switch node

  • Reboots all compute nodes in the rack to ensure proper connectivity

  • Waits for agent connectivity to confirm successful recovery

  • Validates there are no inactive NVLinks

  • Runs comprehensive compute tray validation using BREAKFIX_COMPUTE_TRAY_VALIDATION

B200#

Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Cluster Validation#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures for DGX B200 systems. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to multiple node configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron (Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Note: In B200 systems, “Unit” refers to an individual compute node.

_images/image10.png

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one-click implementation of all testing procedures.

To access the baseline testing interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)

    _images/bcn_ahr_image01.png
  • Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the following interface. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

    _images/bcn_ahr_image02.png

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. Obtain the SOT file from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them. If any execution of the firmware validation runbook results in versions not found, please re-run the runbook.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SUT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

_images/image31.png
Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

_images/image32.png

The Landing page contains two tabs: Report Templates and Published Reports.

_images/image33.png

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SUT and MUT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SUT and MUT tests have been executed.

Guide to initiate Reporting for Baseline Testing#

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.

  • Optional: While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.

    _images/image34.png
  • Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.

    _images/image35.png
  • Retain the auto-generated name for the report which includes the timestamp or provide your own name.

  • Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SUT and MUT linked reports.

  • A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.

    _images/image36.png
  • When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SUT and MUT reports, with the data captured at that given time.

    _images/image37.png
  • All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

    _images/image38.png

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#
  • “DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SUT/MUT tests for the compute nodes.

  • Each stage is represented by a separate cell, displaying the results for that specific SUT or MUT test.

    _images/image41.png
  • To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.

    _images/image42.png
  • Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SUT and MUT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SUT1, SUT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

  • The Y-axis displays the Rack name, allowing you to quickly identify the location of each resource within the cluster.

  • The X-axis represents the number of trays within each rack.

For example, in the following visualization, each bar graph indicates the number of trays within a rack. In this case, there are 2 trays per rack, as denoted by the number displayed on the bars. This provides a clear view of the test progress for each tray across different racks. ** note: for NVL72, you will have 18 trays per rack in the report.

_images/image43.png
  • You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.

    _images/image44.png
  • Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

    _images/image45.png

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

  • Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SRT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.

    _images/image46.png
  • Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.

    _images/image47.png
  • In this example, GPU VBIOS Version has failed for all the trays. Click on the bar for one of the racks to list the resources that the test failed on.

    _images/image48.png
  • Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.

    _images/image49.png
  • The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (VBIOS). Alternatively, you can also scroll through the run to find the failed tests.

    _images/image55.png _images/image56.png
  • Click on the “Command filter excluded x/y resources” to get detailed output for each resource.

    _images/image57.png
  • Click on the Output to view the detailed logs

_images/image58.png _images/image59.png

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

  • Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.

  • Expected Version: Shows the firmware version that was expected for the node.

  • Current Version: Displays the actual firmware version currently installed on the node.

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SUT and MUT.

  • Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.

  • Hostname & Rack Information: For each resource, you will see the hostname and rack name, along with a Tag sequence that indicates the stage of both the SUT and MUT tests.

  • Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SUT and MUT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.

  • Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

    _images/image64.png

The dashboard allows you to sort the resources by Name, Rack, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

  • Name: Sort resources alphabetically by their hostname for quick access.

  • Rack: Sort resources based on their rack assignment, ideal for organizing by physical location.

  • Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

  • Click on the “Create View” button located at the top right corner of your screen.

    _images/image65.png
  • Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.

    _images/image66.png
  • Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.

    _images/image67.png
  • The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

    _images/image68.png

Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

Alarms dashboard overview

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

  • Prolog Checks (run before the job starts):

    • If a check fails, the node is marked as DRAIN, and the job is re-queued.

    • If it passes, the job proceeds normally.

  • Epilog Checks (run after the job finishes):

    • If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

  • SLURM_CHECKS_ENABLE – enables the checks

  • SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_ENABLE” AND “ALARMS_DISABLE” runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI and perform the following:

  • Automatically enabled for racks that pass Single Rack Testing

  • Automatically disabled during firmware upgrade and Break/fix

  • Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

  • ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.

  • ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state via ibstat, and also matching the expected transfer rate

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms.

check_bmc_ipmi_version - Checks BMC IPMI version against an expected value

check_nvidia_module_loaded - Verifies the NVIDIA module is loaded in the host OS

check_host_os_version - Verifies the DGX OS version matches the expected value

check_nvsm_status - Verify the NVSM service is currently active

periodic_cpu_mem_checks#

check_cpu_health - Verifies CPU sockets and cores are present and online

check_dimm_count - Checks that all expected memory DIMMs are present

check_dimm_size - Checks that the size of each memory DIMM matches the expected values

check_memory_swap_size - Checks that the memory swap size matches the expected value

periodic_gpu_nvlink_checks#

check_gpu_pci - Checks that all GPUs are present on the lspci interface and with the correct link width and speed

check_gpu_error - Checks GPUs for ECC errors, retired pages, and throttles present

check_gpu_powerstate - Checks the powerstate for each GPU and compares against an expected value

check_gpu_param - Checks that specified GPU parameters are present and correct for the host

check_nvlink_health - Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.

check_gpu_topology - Checks that there are no issues with the p2p topology within the node

check_gpu_telemetry - Checks that various sensors can be successfully read from the GPU via nvidia-smi

check_gpu_power_limit - Checks that the power limit is correct for each GPU

check_nvidia_inforom_ver - Checks that the inforom version is correct for each GPU

check_gpu_clock_info - Checks that the maximum clock speed is correct for each GPU

check_remapped_row - Checks if any remapped row events have occurred

periodic_network_checks#

check_ib_ber_and_ro - Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the ConnectX using mlxlink

check_ib_port_rcv_errors - Check Infiniband devices port RCV errors

check_ib_cables - Checks the cable info using mlxcables

check_bf3_speed - Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#

check_pex_switch_health - Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci

check_cx_config - Checks that the ConnectX devices have the correct PCIe link speed and width via lspci and ACS config via setpci

check_nvme_health - Checks that the PCIe link speed and width of each NMVe device matches the expected value

check_storage_dir - Checks that the host has functional access to the home storage

check_storage_util - Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

Checks for which the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

check_sel_event - Read the SEL events from the BMC and ensure none are asserted

check_dgx_os_version - Verifies the DGX OS version matches the expected value

check_gpu_vbios_ver - Checks the VBIOS version of the GPUs and compares against an expected value

check_nvme_fw_ver - Checks that the FW version for each NVMe matches an expected value

check_kernel_commandline_opt - Verifies the specified kernel option(s) is present in the current kernel’s boot parameters

check_host_bios_ver - Verifies the system’s BIOS version

check_kernel_ver - Verifies the current version of the Linux kernel

check_host_package_versions - Queries the installed packages on the host

nv_container_cli_info - Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval and Automation. An example configuration is shown in the following figure.

Alarm configuration example

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | name =~ “.*”` will only check alarms on hosts which have a value set for the “name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Automation#

You may use the Automation settings to have Runbooks triggered when an alarm fires, and you may also customize the informational messages that are displayed in the Alarm’s logs.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Alarm states screenshot

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

Overview#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB300 racks. The distinct components for which firmware can be upgraded using this process are:

  • Compute trays

  • Switches

  • Mellanox

  • NVOS

  • Powershelf (PSU and PMC)

Asynchronous component workflows: Firmware upgrade workflows for different component types (compute tray, switch, Mellanox, and NVOS) run asynchronously; powershelf firmware upgrades are not included in asynchronous execution yet. You can schedule, run, and monitor each component workflow independently, and multiple firmware runs may be in progress for different components at the same time. The subsections below describe each workflow; use run history and Firmware Reports to track status across concurrent upgrades.

The workflow invocation is performed via autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search by the FIRMWARE_UPGRADE label as shown below.

_images/fw_image12.png

Using the filter will reduce the runbooks displayed to a list. In general, you will use these runbooks to upgrade the compute trays, switches, or switch NVOSes by following the steps in the next section.

_images/fw_image13.png

Preparing the Upgrade#

In the firmware upgrade runbook, nvfwupd and nv action are used with the upgrade package file to determine the versions needing upgrade. You will need to obtain the firmware package and the Source of Truth (SOT) JSON file, the latter of which defines the referenced settings used for validation.

The SOT JSON may be obtained from the NVIS team, whereas the firmware packages may be downloaded from the NVIDIA Application Hub.

Source of Truth Snippet (truncated)#

{
    "ProductName": "DGX-GB300-NVL72",
    "SOTUniqueID": "1",
    "SOTType": "Release",
    "Milestones": [
        {
            "TemplateVersion": "0.5",
            "Id": "f32d9ee4-2df4-4544-9f2a-e4e19d7cd894",
            "Name": "1.0.00GA",
            "State": "Onboarded",
            "ReleaseDate": "2025-09-04T16:49:58.186442",
            "ReleaseCustomers": [],
            "Tests": [],
            "Packages": [],
            "BoardSKUs": [
                {
                    "SKUID": "699-24764-0001-TS3,699-24764-0001-TS1,692-24764-0001-000",
                    "Name": "P4059",
                    "Components": {
                        "Software": [
                            {
                                "Component": "DOCA_Host",
                                "Version": "3.1.0-091513",
                                "External": true,
                                "Sideload": true,
                                "Informational": false,
                                "Type": "Prod",
                                "Locations": [
                                    {
                                        "Location": "https://linux.mellanox.com/public/repo/doca/3.1.0-091513/ubuntu24.04/arm64-sbsa/doca-ofed_3.1.0-091513_arm64.deb",
                                        "LocationType": "HTTP",
                                        "Distro": "Ubuntu",
                                        "Architecture": "All",
                                        "PackageName": "GB300NVL72_DOCA_MFT",
                                        "PackageSubdirectory": "",
                                        "External": false
                                    }
                                ],
                                "SubComponents": []

Ordering Constraints#

The runbooks used to perform upgrades automatically determine the ordering of applicable packages and will AUX cycle nodes when appropriate. The following paragraph describes this ordering, but is for informational purposes only, as there is no requirement for the user.

For the compute tray, older firmware packages require the BMC to be upgraded prior to HMC, but this is no longer the case with modern firmware. Both BMC and HMC can be upgraded within a single AC cycle.

For the switch tray, the BMC firmware should be updated first. SBIOS and CPLD packages may be upgraded within the same AC cycle. Our runbooks take care of this ordering for you.

Coordination with other jobs#

To prevent other tasks from utilizing the nodes undergoing the upgrade process, AHR will do two things:

  1. It will tag the nodes with a special maintenance tag.

  2. Subsequently, it will drain the node via Slurm on BCM.

In particular, this will prevent other upgrade processes from interfering, and will also bypass AHR’s breakfix workflow. This tag will be automatically removed upon successful completion of the upgrade, and the node will be undrained. If there’s an issue during the upgrade process, this tag and drain state will remain for further investigation. At this point of failure, the user should review the failures, and return the nodes to undrained and remove the maintenance tag once the nodes are deemed healthy.

If you need to remove the maintenance tags after the firmware upgrade process encounters an issue, troubleshooting has completed, or even after unsuccessful breakfix triage, you may do so using the CLEAR_MAINTENANCE_TAGS runbook, specifying resource_tag (e.g. rack_name) and resource_value (e.g. B05) as parameters.

Performing the Upgrade#

Entry-point runbook: FIRMWARE_UPGRADE

The FIRMWARE_UPGRADE runbook is the single entry point for upgrading both compute and switch firmware (and optionally Mellanox/NVOS when the corresponding paths are provided). It performs resource scoping, maintenance tagging, drain, and then invokes the appropriate child runbooks: FIRMWARE_UPGRADE_COMPUTE when compute packages are specified, and FIRMWARE_UPGRADE_SWITCH when switch packages are specified. After upgrades, it runs AUX cycle (if enabled), undrains nodes, removes the maintenance tag, and runs firmware validation. Use this runbook for combined compute-and-switch upgrades or for compute-only or switch-only upgrades by setting only the relevant package path(s).

This section details the steps and parameters for each component. The runbooks take care of invoking the process, performing the upgrade, calling any ancillary runbooks, and ultimately performing validation of the upgraded component once complete.

_images/firmware_upgrade_parent_flowchart.png

Figure: Firmware Upgrade (high-level). Detailed flows: Compute Tray (fw_image01), Switch Tray (fw_image02), Switch NVOS (fw_image03), Mellanox (mellanox_fw_upgrade_flowchart).

Compute Tray#

_images/fw_image01.png

Prerequisites#

  1. Download the firmware upgrade packages.

  2. Obtain the SOT JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

    • Upgrade packages use the following naming convention:

      • nvfw_DGX*

      • nvfw_HGX*

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point for compute and switch). To upgrade compute firmware, provide the following parameters. The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE when FWPKG_DIR_PATH_COMPUTE is set.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., rack name or regex: A01, m06|m07).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

  4. FWPKG_DIR_PATH_COMPUTE – Directory containing the compute firmware upgrade packages (nvfw_DGX*, nvfw_HGX*). Set to empty if you are only upgrading switch.

Optional parameters (commonly used):

  • FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware packages. Set if you also want to upgrade switch in the same run; leave empty for compute-only.

  • NVOS_FILE_PATH – Full path to the NVOS bin file (if upgrading switch NVOS).

  • FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package. Set this (and FWPKG_BF_FILE_PATH) if you also need to upgrade Mellanox (ConnectX/BlueField) in the same run as compute.

  • FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package. Set this (and FWPKG_CX_FILE_PATH) if you also need to upgrade Mellanox in the same run as compute.

  • FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks; false follows standard upgrade rules.

  • IGNORE_LIST – List of nodes to exclude from scope (e.g., node01, or “none”).

  • AUX_CYCLE – When true (default), performs an AUX power cycle after upgrade; set to false to skip.

When you need to upgrade compute and Mellanox together, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH set to the ConnectX and BlueField package paths. When you need to upgrade only compute firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_DIR_PATH_SWITCH left empty. For advanced scenarios you may run the BreakFix_Firmware_Upgrade_Compute_nvfwupd runbook directly (e.g., single rack).

Switch Tray#

_images/fw_image02.png

Prerequisites#

  1. Download the firmware upgrade packages.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

    • Upgrade packages use the following naming convention:

      • *nvfw*_0004_*

      • *nvfw*_0006_*

      • *nvfw*_0007_*

Upgrading#

When performing this upgrade, it should be noted that all of the rack’s 18 nodes will be drained from the Slurm pool, tagged with our maintenance tag, and subsequently AUX cycled.

To begin, from the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point). To upgrade switch firmware, set FWPKG_DIR_PATH_SWITCH (and optionally NVOS_FILE_PATH for NVOS). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH when FWPKG_DIR_PATH_SWITCH is set. Provide the same resource_tag, resource_value, and FW_SOURCE_JSON_PATH as for compute.

Required parameters (when upgrading switch):

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware upgrade packages. Set to empty if you are only upgrading compute.

Optional: NVOS_FILE_PATH – Full path to the NVOS bin file if upgrading switch NVOS. FORCE_UPGRADE, AUX_CYCLE – same as for compute.

When you need to upgrade only switch firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_SWITCH set and FWPKG_DIR_PATH_COMPUTE left empty. The FIRMWARE_UPGRADE_SWITCH child runbook is invoked by the parent and is not intended to be run directly. For advanced scenarios you may run the BreakFix_Switch_BMC_Upgrade_Nvfwupd runbook directly.

Switch NVOS#

_images/fw_image03.png

Prerequisites#

  1. Download the NVOS upgrade package bin file.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Switch Tray and Compute. To upgrade switch NVOS, set NVOS_FILE_PATH and optionally FWPKG_DIR_PATH_SWITCH (if also upgrading switch tray firmware). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH, which performs both switch tray firmware and NVOS upgrade when NVOS_FILE_PATH is provided.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. NVOS_FILE_PATH – Full path to the NVOS bin file including the file itself.

Optional: FWPKG_DIR_PATH_SWITCH – Set if you are also upgrading switch tray firmware in the same run; leave empty if only upgrading NVOS.

Alternatively, run BreakFix_NVOS_Upgrade directly with resource_value, NVOS_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_SWITCH (from flex query or parent context).

Mellanox#

_images/mellanox_fw_upgrade_flowchart.png

Figure: Mellanox FW Upgrade (ConnectX and BlueField) — from BreakFix_Firmware_Upgrade_Mellanox runbook.

Prerequisites#

  1. Download the BlueField and ConnectX upgrade packages.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Compute and Switch. To upgrade Mellanox (ConnectX and BlueField) firmware, set FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH; you can set FWPKG_DIR_PATH_COMPUTE to a directory (or leave empty if only Mellanox). The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE, which performs ConnectX and BlueField firmware upgrade when these paths are provided.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01, or regex).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package file.

  5. FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package file (either .bin or .bfb package).

Optional: FWPKG_DIR_PATH_COMPUTE – Directory of compute firmware packages if you are also upgrading compute tray in the same run; leave empty for Mellanox-only.

Alternatively, run BreakFix_Firmware_Upgrade_Mellanox directly with FWPKG_CX_FILE_PATH, FWPKG_BF_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_COMPUTE (from flex query or parent context).

Powershelf#

GB300 supports firmware upgrades for power shelves (LiteOn and Delta). Use the FIRMWARE_UPGRADE_POWERSHELF runbook. It upgrades PMC (Power Management Controller) first, then PSU, and finally runs SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF for validation.

Prerequisites#

  1. Obtain the PSU and PMC firmware packages and the golden configuration JSON file (from the NVIS team or firmware package).

  2. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE_POWERSHELF runbook. Required parameters (per runbook):

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01, or regex).

  3. IGNORE_LIST – List of nodes to ignore for baseline tests. Accepted formats: single node (e.g. node01), list (e.g. [“node01”, “node02”]), pipe-delimited (e.g. node01|node02), or “none” if no nodes should be ignored.

  4. PSU_FILE_FULL_PATH – Full path to the directory and file name of the PSU firmware package.

  5. PMC_FILE_FULL_PATH – Full path to the directory and file name of the PMC firmware package.

  6. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. This file defines the reference settings used for validation.

  7. FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks or validation; false follows standard upgrade rules.

The runbook invokes POWERSHELF_COMPONENT_UPGRADE for PMC first, then PSU, and finally SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to verify the upgrade.

Powershelf firmware upgrade workflow

Figure: Powershelf Firmware Upgrade Workflow – This diagram illustrates the end-to-end Powershelf firmware upgrade process. FIRMWARE_UPGRADE_POWERSHELF obtains the target powershelves via flex query, then invokes POWERSHELF_COMPONENT_UPGRADE for PMC (Power Management Controller) first and PSU next, each with optional exit on failure. The workflow concludes with SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to validate the upgrade against the golden configuration JSON.

BF Firmware Bundle Extraction Guide#

This guide explains how to extract a BF firmware package provided as a BF Bundle (.bfb).

Important:
These steps must be performed on the compute node, as it already has the bfb-tool utility installed.

Prerequisites#

Ensure the following dependency is installed:

sudo apt install -y qemu-user-static

Extracting the BF Bundle

Note: The --bfb argument must use the complete (absolute) path to the .bfb file. Relative paths may cause the extraction to fail.

bfb-tool extract \
  --bfb /path/to/bf-fwbundle-<version>-prod.bfb \
  --opn 900-9D3B6-00CN-P_Ax

Using the Extracted Firmware

After extraction, go to the folder created in /tmp (named after the .bfb file). Inside this folder, open the subfolder corresponding to your OPN (e.g., 900-9D3B6-00CN-P_Ax). In that subfolder, locate the .bin firmware file, which should be used as the input to the runbook.

Troubleshooting#

1. Excluding a node from upgrade or runbook cells#

To exclude a particular node, you can add a cell near the start of your runbook (after INPUT_SWITCH is exported) with something like the following:

INPUT_SWITCH | name != "<node_name>" | export(“INPUT_SWITCH")

This will export all the resources except <node_name> back into the INPUT_SWITCH variable.

Firmware upgrade includes a series of steps to ensure the nodes are removed from jobs being scheduled, complete the upgrade, and place the nodes back in service. Below are some common issues you may encounter during the firmware upgrade:

2. Nodes not reachable from headnode#

To upgrade firmware, the BMC IP must be accessible from the headnode. The runbook verifies node accessibility and automatically skips unreachable nodes.

Action: Ensure the node is online and accessible from the headnode, then rerun the firmware upgrade runbook.

_images/fw_image04.png

3. Failed firmware upgrade#

The firmware upgrade failed to complete successfully. Possible causes include failures in the nvfwupd command, NV OS, or the flint command, depending on the package.

Action: Verify logs by clicking on the Output, which includes the command’s stdout and stderr.

_images/fw_image05.png _images/fw_image06.png

Note that the runbook does not automatically undrain or untag maintenance when the firmware upgrade fails. After verifying that the failures are safe to ignore and the nodes are ready to return to the pool, undrain, and untag the nodes using the UNDRAIN_AND_UNTAG_RACKS runbook. Provide resource_tag (e.g. rack_name) and resource_value (e.g. B05 or m06|m07) when prompted.

_images/fw_image07.png

4. SSH failures for Switch Upgrade runbooks.#

The switch firmware upgrade runs commands via SSH. The runbook dynamically retrieves the user and password from BCM using cmsh commands, which are then used to run the SSH commands.

Action: Verify SSH access to the switch from the headnode using the credentials stored in BCM.

_images/fw_image08.png

5. Failed validation stage#

The last step in every firmware upgrade is validation. The runbook selects a subset of tests to verify upgrade success. Failures may result from upgrade issues, incorrect SOT JSON, or command failures when fetching component versions.

Action: Compare the expected and actual versions in the logs and check for any other errors.

Notes#

  • Netcat is used to check if nodes are back online after a reboot.

  • The runbook cannot exclude a subset of nodes. This means if any nodes are down, the runbook will ignore the node and upgrade others.

  • Multiple racks cannot be upgraded simultaneously.

  • If nvfwupd does not upgrade, (due to already being at the specific version, for example) and FORCE_UPGRADE is not specified as true, the runbook will exit after untagging from maintenance.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

Break/Fix Introduction#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB300. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

_images/image93.png

Figure: Compute Break/Fix Workflow (high-level) — Trigger, Triage, Validation, then Return to service or Run diag / Support ticket.

_images/image94.png

Figure: Compute Break/Fix Workflow (detailed) — Explains each phase: 1. Triage (connectivity check, SRAM UC check, leak detection, power cycle, GPU recovery), 2. Verification (BREAKFIX_COMPUTE_TRAY_VALIDATION, tests passed?), 3. Run diag (BREAKFIX_DIAG_DUMP, support ticket creation). All paths from triage converge on validation; failed validation leads to diagnostic dump and support ticket; success leads to return to service.

Key Features#

  • Automatic Detection: Identifies drained nodes without manual intervention

  • Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms

  • Comprehensive Diagnostics: Performs thorough hardware and software checks

  • Automated Remediation: Attempts to resolve issues without human intervention when possible

  • Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

To access the break/fix interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “BREAKFIX_TRIGGER” runbook through the search functionality

  • Upon selecting the “BREAKFIX_TRIGGER” runbook, you will be presented with the interface showing the workflow for automated break/fix.

    _images/image95.png

Run Break/Fix Workflow#

The BREAKFIX_TRIGGER is a time-triggered runbook that automatically runs every 5 minutes. When executed, it:

  • Gets drained nodes from BCM and proceeds only if the drain reason is one of the allowed reasons. Otherwise the runbook exits with no action.

  • Allowed drain reasons (runbook exits if no nodes have these reasons): Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.

  • Processes one node per execution cycle from the drained node pool; applies a maintenance tag in AHR to prevent duplicate processing, and ensures drained nodes are handled sequentially across multiple runs.

  • Verifies the AHR agent is connected and accepts commands before proceeding; exits if the agent does not accept commands.

  • Gets the drain reason from BCM, updates it with “(AHR in-progress)”, and re-drains the node via BCM with the updated reason.

  • Invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously to perform triage on the selected compute node.

  • No manual intervention is required to start the process.

The BREAKFIX_TRIGGER runbook is shared across GB300, GB200, and B200. It has no user-configurable scope parameters; it discovers drained nodes from BCM automatically.

One can see the time trigger settings on the right side of the Runbook under Triggers, where it shows that it is currently enabled and runs every 5 minutes.

_images/image96.png

You can also manually trigger the workflow:

  • Navigate to the “BREAKFIX_TRIGGER” runbook in the Runbooks section

  • Select the “Create Run” button positioned in the top right corner of the interface

    _images/image97.png
  • After initiating the process, a confirmation dialog will appear with a “View Run” link

  • Selecting this link will redirect you to a page displaying comprehensive job status and details

The automated nature of this workflow ensures that system issues are addressed promptly without requiring constant monitoring or manual intervention.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute nodes. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry-point runbook (shared for GB300, GB200, and B200 compute break/fix) that:

  • Runs automatically every 5 minutes via time trigger.

  • Gets drained nodes from BCM and only proceeds if the drain reason is one of the allowed reasons: Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.

  • Processes one node per run: selects a drained node not already in maintenance, verifies AHR agent accepts commands, sets the maintenance tag, gets and updates the drain reason with “(AHR in-progress)”, re-drains via BCM, then invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously.

  • Has no user-facing parameters; node discovery and scope are driven by BCM drained state.

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.

_images/image98.png

This runbook performs comprehensive triage on drained compute trays with two main workflows:

Workflow for Unresponsive Compute Nodes#
  1. Initial Assessment

    • Tests connectivity using ping to check if compute nodes are responsive or unresponsive

    • Identifies nodes that are already unresponsive and require recovery

  2. Leak Detection

    • Checks if any leaking is reported through BCM

    • Creates Support ticket immediately for any nodes with detected leaking (if opted in to support ticket service)

  3. Recovery Process for Non-Leaking Nodes

    • Initiates power cycle for nodes without leaking issues

    • Waits and checks if hosts come back online

    • Waits until the AHR agent is connected to confirm successful recovery

  4. Failure Handling

    • Creates Support ticket for hosts that fail to start up (if opted in to support ticket service)

  5. Validation

Workflow for Responsive Compute Nodes#
  1. GPU Recovery Assessment

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

_images/image99.png

This specialized diagnostic runbook focuses on GPU-related issues:

  1. Verification and Assessment

    • Verifies the node is still drained from BCM

    • Categorizes recovery actions (Reboot, Reset, or None)

  2. Recovery Actions Based on Type

    • For Reboot Action:

      • Reboots the node requiring GPU reboot

      • Waits for host to come back online

      • Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION if host is up

      • Creates Support ticket for host that fail to start up (if opted in to support ticket service)

    • For Reset Action:

    • For No Action Required:

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

_images/image100.png

After remediation actions, this runbook validates the system health:

  1. Comprehensive Testing

    • Runs testing suites to validate the compute tray

    • Executes HPL test (HPL use mpirun for single-node execution instead of Slurm for break/fix scenarios)

    • Runs AHR prolog script to prevent undraining of nodes that are still failing prolog checks, since undraining will result in them being drained again at the next Slurm invocation

  2. Result Handling

    • For failed tests, automatically runs BREAKFIX_DIAG_DUMP for detailed diagnostics

    • For passed tests, undrains/untags the host to return it to service

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

_images/image101.png

This runbook collects comprehensive diagnostic information:

  • Runs NVSSVT (NVIDIA System Software Validation Toolkit)

  • Collects NVSM (NVIDIA System Management) health dumps

  • Executes EUD (End User Diagnostics)

  • Runs Partnerdiag if necessary

  • Creates a consolidated diagnostic log dump package

  • Generates a Support ticket with diagnostic log for support (if opted in to support ticket service)

Prerequisites for EUD and Partnerdiag:

  • EUD: Binary must be installed on every compute node for execution

  • Partnerdiag: Binary must be installed on the head node under path /cm/shared/partnerdiag

  • If these binaries are not properly installed, EUD and Partnerdiag will be skipped during diagnostic collection

View Break/Fix Result#

Users can monitor break/fix operations and determine outcomes through multiple methods:

Accessing Break/Fix Results#

Via Runbook Execution View:

  1. Click “Runs” in the upper left corner

  2. Filter by “BREAKFIX_TRIGGER” to see all break/fix executions

  3. Select a specific run to view detailed execution flow

    _images/image116.png

Via Resource Run History:

  1. Navigate to “Resources” in the left panel and search for the resource (e.g., “b06-p1-dgx-06-c05”)

    _images/image117.png
  2. Click on the resource name

    _images/image118.png
  3. View the “Run History” page which displays all runbooks the resource participated in, with execution timestamps and status

    _images/image119.png
  4. Filter or search for BREAKFIX related runbooks to see the complete history of remediation attempts for that specific node

    _images/image120.png

Understanding Break/Fix Outcomes#

Successful Recovery Indicators:

  • Node status changes from “DRAINED” to “IDLE” or “ALLOCATED” in BCM

  • Maintenance tag is removed from the node

  • BREAKFIX_COMPUTE_TRAY_VALIDATION shows “PASSED” status

  • Node is automatically returned to service

Failed Recovery Indicators:

  • Node remains in “DRAINED” state

  • Support ticket is automatically created (if opted in to support ticket service)

  • BREAKFIX_DIAG_DUMP execution indicates diagnostic collection

  • Maintenance tag remains on the node

  • Drain reason is updated to add “(AHR complete)” to indicate AHR processing has finished. Note that “(AHR in-progress)” indicates Break/Fix is still processing the node.

Determining Recovery Path#

GPU Recovery Path:

  • Look for GPU_RECOVERY runbook execution in the workflow

  • Check if GPU reboot or reset actions were performed

  • Validation results indicate GPU functionality restoration

Power Cycle Recovery Path:

  • BREAKFIX_COMPUTE_TRAY_TRIAGE shows auxiliary power cycle execution

  • Node connectivity tests show successful recovery

Monitoring Ongoing Operations#

  • Break/fix operations run every 5 minutes automatically

  • Check the “maintenance” tag to see which nodes are currently being processed

  • Review recent BREAKFIX_TRIGGER executions to track system-wide break/fix operations

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

Break/Fix Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

_images/image102.png

Figure: Break/Fix Post RMA Workflow - This diagram illustrates the comprehensive post-RMA process for compute tray replacement, starting with BCM inventory updates for new MAC addresses, followed by BMC credential management, OPROM configuration, and boot order correction. The workflow then proceeds through firmware updates to ensure correct versions, concludes with system validation including agent connectivity verification and comprehensive testing via BREAKFIX_COMPUTE_TRAY_VALIDATION, and automatically returns validated hardware to service.

Key Features#

  • Automated Configuration: Configures replaced hardware components with proper settings

  • Firmware Updates: Updates firmware to match the required versions for the environment

  • Boot Order Correction: Ensures proper boot sequence for reliable operation

  • Comprehensive Validation: Performs thorough testing to verify hardware functionality

  • Seamless Integration: Automatically returns validated hardware to service

Post RMA Workflow Components#

The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:

Physical Replacement Procedures#

  • Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling

  • Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection

  • Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection

BCM Inventory Update#

  • ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed (GB300 uses a dedicated runbook with parameters for BMC MAC, BF3_1 MAC/Storage, host name, and optional tray serial number)

  • Skip this step for repaired trays as MAC addresses remain unchanged

  • Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)

  • Enables proper management and monitoring of the replaced hardware

BMC Credential Management#

  • Creates necessary BMC credential files for secure access to hardware components

  • Establishes secure communication channels for configuration operations

BlueField Configuration#

  • Checks if BlueField devices are in NIC mode

  • Enables OPROM on BlueField devices to ensure proper initialization

  • Configures hardware components for optimal operation

Boot Order Correction#

  • Ensures the boot sequence is properly configured

  • Prevents boot failures and improves system reliability

  • Performs power reset through BMC after configuration changes

Connectivity Verification#

  • Verifies SSH connectivity to compute nodes

  • Checks BCM device status to ensure proper registration

  • Confirms network accessibility before proceeding with firmware updates

Firmware Updates#

  • Upgrades compute firmware to the version in the specified package directory

  • Upgrades Mellanox BlueField and ConnectX firmware using the provided package file paths

  • Uses the golden configuration JSON (FW_SOURCE_JSON_PATH) for validation; all components are updated in a single run

System Validation#

  • Waits for hosts to come back online after each firmware update cycle

  • Verifies agent connectivity to ensure management capabilities

  • Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION

  • Opens nodes in BCM and validates Slurm readiness for successful nodes

Running the Post RMA Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new compute tray. Skip this step for repaired trays as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses and hardware identifiers using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook. This runbook updates the BCM inventory accordingly and supports provisioning the compute node and running baseline tests.

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY” runbook in the Runbooks section

    _images/image103.png
  2. Configure the required parameters (MAC addresses and serial numbers are provided by the Enterprise Support team):

    • HOST_NAME (required): Host name to update in BCM inventory (e.g., a07-p1-dgx-03-c08)

    • BMC_MAC (required): MAC Address of the BMC

    • BF3_1_MAC (required): MAC Address of Predictable Network Interface Name enP22p3s0f0np0

    • BF3_1_STORAGE (optional): MAC Address of Predictable Network Interface Name enP22p3s0f1np1

    • TRAY_SERIAL_NUMBER (optional): Serial Number of the Tray

  3. Select “Create Run” to initiate the BCM inventory update

    _images/image104.png
  4. Monitor the execution progress to ensure successful completion

Step 2: Execute Main Post RMA Workflow#

After successfully updating the BCM inventory (if required for new hardware), proceed with the main Post RMA workflow:

To execute the Post RMA workflow:

  1. Navigate to the “BREAKFIX_POST_RMA” runbook in the Runbooks section

    _images/image105.png
  2. Configure the required parameters:

    • HOST_NAME (required): Name of the replaced host (e.g., a07-p1-dgx-03-c08)

    • FWPKG_DIR_PATH (required): Full path to the directory containing all firmware packages or the individual package (.fwpkg) for compute nodes

    • FWPKG_BF_FILE_PATH (required): Full path to the BlueField (BF3) firmware package file

    • FWPKG_CX_FILE_PATH (required): Full path to the ConnectX firmware package file

    • FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation. Set to NA if not available.

    • resource_tag (optional): Tag for resource query (e.g., rack_name)

    • resource_value (optional): Value for resource query (e.g., rack identifier)

  3. Configure the required secrets (if not already configured):

    • AHR_API_ENDPOINT: The API endpoint URL for NVIDIA Mission Control

    • AHR_TOKEN: Authentication token for API access

    To add or update these secrets:

    1. In the Shoreline UI, go to Settings.

    2. Click on the Secrets section.

    3. Use the + Secret button to create:

      • AHR_API_ENDPOINT — Provide the correct API endpoint.

      • AHR_TOKEN — Provide the secure API token.

    4. If a secret already exists, click its name to update the value.

    5. Click Save to persist the changes.

    🔐 These secrets will be securely injected into the action at runtime.

    _images/image106.png
  4. Select “Create Run” to initiate the workflow

    _images/image107.png
  5. Monitor the execution progress and results

Note: The workflow includes detailed physical replacement instructions that must be followed before executing the automated portions. Ensure all physical replacement steps are completed as outlined in the runbook’s markdown sections.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA Powershelf#

Break/Fix Post RMA Powershelf Introduction#

The Break/Fix Post RMA Powershelf workflow brings a replaced power shelf tray back into service after an RMA. It is available for GB300 and GB200 and consists of two runbooks: updating BCM inventory for the new powershelf node (when applicable), then running the main powershelf post RMA workflow to verify BCM status, identify the PowerShelf manufacturer (LiteOn or Delta), and upgrade PSU and PMC firmware to the provided versions.

_images/powershelf_post_rma_flowchart.png

Figure: Powershelf Post RMA Workflow — BCM inventory update (if new HW), Check BCM device status, Determine manufacturer (LiteOn or Delta), Upgrade PMC and PSU firmware, Return to service.

Running the Post RMA Powershelf Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW POWERSHELF HARDWARE)#

Note: This step is ONLY required when installing a new power shelf tray. Skip for repaired trays where the BMC MAC is unchanged.

Use the BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY runbook to register the new powershelf in BCM:

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY” runbook in the Runbooks section.

  2. Configure the required parameters:

    • HOST_NAME (required): Host name to update in BCM inventory (e.g., b06-p01-pwr-01).

    • BMC_MAC (required): MAC address of the powershelf BMC.

  3. Select “Create Run” and monitor the execution.

Step 2: Execute Main Powershelf Post RMA Workflow#

After updating BCM inventory (if required for new hardware), run the main powershelf post RMA workflow:

  1. Navigate to the “BREAKFIX_POWERSHELF_POST_RMA” runbook in the Runbooks section.

  2. Configure the required parameters:

    • resource_value (required): Scope value (e.g., rack name) for the workflow.

    • HOST_NAME (required): Name of the replaced powershelf host (e.g., b06-p01-pwr-01).

    • PSU_FILE_FULL_PATH (required): Full path to the PSU tar file for powershelf nodes.

    • PMC_FILE_FULL_PATH (required): Full path to the PMC tar file for powershelf nodes.

    • FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation.

  3. Select “Create Run” to start the workflow. The runbook checks BCM device status, determines the PowerShelf manufacturer (LiteOn or Delta), and upgrades PSU and PMC firmware via the firmware upgrade step. Monitor execution until completion.

NVIDIA Mission Control autonomous hardware recovery Domain Triage#

Domain Triage Introduction#

The BREAKFIX_DOMAIN_TRIAGE runbook provides manual diagnostics and troubleshooting for NVSwitch and NVLink domain-level issues. This workflow is only manually triggered on demand when domain-level problems are identified; it is not automatically triggered by BREAKFIX_TRIGGER (unlike the Break/Fix Workflow, which is triggered automatically). This workflow is designed to collect comprehensive diagnostic information when problems are detected at the domain level, facilitating efficient resolution and minimizing system downtime. This runbook is available for GB300 and GB200 only.

_images/image108.png

Figure: Domain Triage Workflow - This diagram illustrates the comprehensive domain-level diagnostic process for NVSwitch and NVLink issues. The workflow begins with compute node management by adding maintenance tags and draining nodes to prevent workload interference. It then proceeds through NVSwitch credential management to establish secure access, followed by extensive diagnostic data collection including BMC log dumps, NVDebug analysis, Nvlmapper connectivity checks, and PartnerDiag hardware diagnostics. The process concludes with case management that consolidates all diagnostic information into a comprehensive package and creates support tickets with attached logs for efficient troubleshooting.

Domain Triage Workflow Components#

The Domain Triage workflow consists of several key steps that ensure thorough diagnosis of NVSwitch and NVLink domain issues:

Compute Node Management#

  • Adds AHR maintenance tags to all compute nodes in the affected rack

  • Drains compute nodes from Slurm to prevent workloads from running during diagnostics (no jobs will be scheduled on the entire rack)

NVSwitch Credential Management#

  • Retrieves BMC credentials for NVSwitches from BCM

  • Establishes secure access to NVSwitch components for diagnostics

  • Collects system rack serial numbers for identification

Diagnostic Data Collection#

  • Dumps BMC logs from NVSwitches to capture hardware-level events

  • Runs NVDebug tool to collect detailed information about NVSwitch status

  • Executes Nvlmapper tool to check NVLink status and connectivity (binary must be installed under /tmp/nvlmapper; see PID for download)

Switch Firmware Upgrade#

  • Upgrades NVSwitch firmware to the version provided in the specified package directory, using the golden configuration JSON for validation

PartnerDiag#

  • Runs PartnerDiag for comprehensive hardware diagnostics (binary must be installed under /tmp/partnerdiag; see PID for download)

Case Management#

  • Collects and organizes all diagnostic logs into a single package

  • Creates a support ticket with the consolidated diagnostic package (default message: NMC_BREAKFIX_NVSwitch; details: “Potential NVSwitch or NVLink issue”)

  • Attaches detailed logs to facilitate efficient troubleshooting

Running the Domain Triage Workflow#

Note: Domain Triage must be manually triggered on demand and is not an automated trigger runbook like BREAKFIX_TRIGGER.

To manually execute the Domain Triage workflow:

  1. Navigate to the “BREAKFIX_DOMAIN_TRIAGE” runbook in the Runbooks section

    _images/image109.png
  2. Configure the required parameters:

    • resource_tag: Tag for the resource query (use rack_name for rack filtering).

    • resource_value: Value for the resource query; set to the rack to diagnose (e.g., B05). Use the same value as you would for RACK_NAME.

    • SWITCH_FWPKG_DIR_PATH: Full path to the directory containing all firmware packages for the switch node.

    • FW_SOURCE_JSON_PATH: Path to the golden configuration JSON file (from the firmware package) that defines the reference settings used for validation.

  3. Optionally set:

    • IGNORE_LIST: Comma-separated list of nodes to exclude from scope (e.g., node01,node02), or none.

    • MESSAGE: Message for the support ticket (default: NMC_BREAKFIX_NVSwitch).

    • SYSTEM_SERIAL, BMC_CRED_FILE, LOG_FILE_PATH: For support ticket and credential/log paths when needed.

  4. Select “Create Run” to initiate the workflow

    _images/image110.png
  5. Monitor the execution progress and results

Domain Triage Results#

Accessing Domain Triage Results#

Via Runbook Execution View:

  1. Navigate to “Runbooks” in the left panel

  2. Click “Runs” in the upper left corner

  3. Filter by “BREAKFIX_DOMAIN_TRIAGE” to see all domain triage executions

  4. Select a specific run to view detailed execution flow

    _images/image121.png

Via Resource Run History:

  1. Navigate to “Resources” in the left panel and search for rack name (e.g., “b07”)

    _images/image122.png
  2. Click on any compute node from the affected rack

    _images/image123.png
  3. View the “Run History” page to see all domain triage executions for that resource

    _images/image124.png
  4. Filter for “BREAKFIX_DOMAIN_TRIAGE” to see domain-level diagnostic attempts

    _images/image125.png

Understanding Domain Triage Outcomes#

After successful completion of the Domain Triage workflow:

Diagnostic Data Collection:

  • BMC logs from all NVSwitches in the affected domain

  • NVDebug output containing detailed switch status information

  • Nvlmapper results showing NVLink connectivity status

  • PartnerDiag comprehensive hardware diagnostic reports

Support Ticket Creation:

  • Automated support ticket generation with consolidated diagnostic package (if opted in to support ticket service)

  • All collected logs attached to the ticket for support team analysis

  • Rack serial numbers and system identification included

  • Maintenance tags remain on compute nodes until issue resolution

Monitoring Domain Triage Progress#

During Execution:

  • Check compute node status - nodes should show maintenance tags

  • Verify diagnostic tool execution in runbook cells

  • Monitor support ticket creation process

Post-Execution:

  • Support teams receive comprehensive diagnostic information

  • All relevant logs are organized and accessible

  • System remains in maintenance mode pending resolution

NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA#

Switch Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA workflow automates the process of bringing NVSwitch components back into service after a Return Merchandise Authorization (RMA) replacement. This comprehensive workflow includes both physical switch tray replacement procedures and automated software configuration to ensure that replaced switch hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production. This runbook is available for GB300 and GB200.

_images/image111.png

Figure: Switch Post RMA Workflow - This diagram illustrates the comprehensive switch replacement process following RMA procedures. The workflow begins with mandatory BCM inventory updates to register new MAC addresses, followed by switch connectivity verification and secure credential establishment. The process continues through factory reset and Zero Touch Provisioning for clean initialization, and firmware updates to match required versions. The workflow concludes with comprehensive system validation including NMX controller verification, rebooting the compute nodes, and thorough compute tray validation to ensure the replaced switch integrates seamlessly with the existing infrastructure.

Key Features#

  • Physical Replacement Procedures: Detailed instructions for safe switch tray removal and installation with proper cooling and power management

  • Compute Node Management: Adds maintenance tags and drains compute nodes during switch replacement to prevent workload interference

  • Switch Connectivity Verification: Establishes and verifies SSH connectivity to replaced switch components

  • Factory Reset and ZTP: Performs factory reset and monitors Zero Touch Provisioning for clean initialization

  • Firmware Updates: Updates switch firmware to the version in the specified package directory using the golden configuration JSON for validation

  • System Validation: Comprehensive testing including NMX controller verification, compute node reboots, and compute tray validation

Switch Post RMA Workflow Components#

The Switch Post RMA workflow consists of several key steps that ensure replaced switch hardware is properly configured and validated:

Physical Replacement Procedures#

  • Switch Tray Removal: Detailed instructions for powering down the entire rack, cooling procedures, cable disconnection, and safe tray removal

  • Switch Tray Installation: Comprehensive installation guide covering rail migration, tray insertion, cable reconnection, and power-on sequence

Compute Node Management#

  • Adds maintenance tags to all compute nodes in the affected rack

  • Drains compute nodes from Slurm to prevent workload interference during switch replacement

BCM Inventory Update#

  • ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY when a new switch is installed

  • Skip this step for repaired switches as MAC addresses remain unchanged

  • Ensures BMC MAC and COMe MAC addresses are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage asset inventory for customer deployments)

  • Enables proper management and monitoring of the replaced switch hardware

Switch Connectivity and Configuration#

  • Retrieves switch IP and credentials from BCM

  • Verifies SSH connectivity to the switch node

  • Updates ZTP settings in BCM with NVOS image file configuration

  • Ensures the switch is reachable for configuration operations

Factory Reset and ZTP#

  • Performs factory default reset on the switch

  • Monitors Zero Touch Provisioning (ZTP) status until successful completion

  • Creates support tickets if ZTP fails and exits workflow (if opted in to support ticket service)

Switch BMC Credential Management#

  • Retrieves BMC credentials for NVSwitches from BCM

  • Establishes secure access to switch components for diagnostics

  • Collects system rack serial numbers for identification

Firmware Updates#

  • Upgrades switch firmware to the version in the specified package directory (SWITCH_FWPKG_DIR_PATH) using the golden configuration JSON (FW_SOURCE_JSON_PATH) for validation

  • Verifies switch connectivity after firmware updates

System Health and Validation#

  • Performs switch tray health checks

  • Verifies NMX-C and NMX-T controller status on the active switch node

  • Reboots all compute nodes in the rack to ensure proper connectivity

  • Waits for agent connectivity to confirm successful recovery

  • Validates there are no inactive NVLinks

  • Runs comprehensive compute tray validation using BREAKFIX_COMPUTE_TRAY_VALIDATION

Running the Switch Post RMA Workflow#

Step 1: Update Switch BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new switch. Skip this step for repaired switches as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses:

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_SWITCH_BCM_INVENTORY” runbook in the Runbooks section

    _images/image112.png
  2. Configure the required parameters (MAC addresses are provided by the Enterprise Support team):

    • SWITCH_HOSTNAME: The hostname of the replaced switch component

    • BMC_MAC: BMC MAC Address from Asset File

    • COMe_MAC_1: COMe MAC 1 Address From Asset File

    • COMe_MAC_2: COMe MAC 2 Address From Asset File

    _images/image113.png
  3. Select “Create Run” to initiate the BCM inventory update

  4. Monitor the execution progress to ensure successful completion

Step 2: Execute Main Switch Post RMA Workflow#

After successfully updating the BCM inventory, proceed with the main Switch Post RMA workflow:

  1. Navigate to the “BREAKFIX_SWITCH_POST_RMA” runbook in the Runbooks section

    _images/image114.png
  2. Configure the required parameters:

    • SWITCH_HOST_NAME (required): Input replaced switch host name.

    • resource_tag (required): Tag for the resource query (use rack_name for rack filtering).

    • resource_value (required): Value for the resource query; set to the rack scope (e.g., A01, or use regex/list format as supported by flex query).

    • SWITCH_FWPKG_DIR_PATH (required): Full path to the directory containing all firmware packages for the switch node.

    • NVOS_IMAGE_FILE_NAME (required): Latest NVOS image file name from the headnode under the path /cm/local/apps/cmd/etc/htdocs/switch/image. If the latest image is not found, copy the image file to this path and provide the file name.

    • FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON that defines the reference settings used for validation.

    • IGNORE_LIST (optional): List of nodes to exclude from scope (e.g., node01, or none).

    • MESSAGE (optional): Message for support ticket creation if ZTP fails (default: AHR_BREAKFIX_NVSwitch_POST_RMA).

  3. Configure the required secrets (if not already configured): AHR_API_ENDPOINT and AHR_TOKEN (used for agent connectivity checks and compute tray validation). Add or update them under SettingsSecrets in the Shoreline UI.

    _images/image115.png
  4. Select “Create Run” to initiate the workflow

  5. Monitor the execution progress and results

GB200#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to complete racks or multiple rack configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron(Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Baseline testing: Single Rack or Multi Rack test mode, then status, firmware checks, resource dashboard

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one click implementation of all testing procedures.

To access the baseline testing interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)

    _images/image11.png
  • Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the interface shown below. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

    _images/image12.png

Run SRT (Single Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Rack Configuration when one or multiple racks are ready for testing.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/image13.png
  • To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

    _images/image14.png

    You are required to provide the following inputs for the runbook

    Resource filtering (flex query): Runbooks use a flexible resource query instead of hardcoding a single rack. You specify:

    • resource_tag (required): Tag for the flex resource query. Use rack_name for rack-based filtering (e.g., to run on a specific rack or set of racks).

    • resource_value (required): Value for the flex resource query. Set to the same value you would have used for the rack name (e.g., B05, A01, m06|m07 for multiple racks), or use “none” if you do not want to filter by that tag. This allows you to target any resource—not just a single hardcoded rack.

    • FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

    • IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:

      • Single node: node01

      • List format: [“node01”, “node02”]

      • Pipe-delimited string: node01|node02

  • After entering the correct resource_tag and resource_value (e.g., resource_tag=rack_name, resource_value=m06 or m06|m07), select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/image15.png

    IMPORTANT NOTES:

  1. How to get the value for resource_value (e.g. rack_name):

    1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value

    2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

      _images/image16.png
    3. Click the button at right top “New Runbook” then you will see below.

      _images/image17.png
    4. In the central page, click “Op Statement” to create your first cell to query the resource

    5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as below example.

      _images/image18.png
    6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not show up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.

      _images/image19.png
  2. Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Run MRT (Multi Rack Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Rack Configuration.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MRT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SRT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/image20.png
  • Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

    _images/image14.png

    The identifier represents the resource filter. For rack-based runs, use resource_tag=rack_name and resource_value set to the rack designation. The system accommodates both single and multiple rack configurations, as detailed below:

    Single Rack Format:

    • Standard notation: resource_value (Example: m06)

    Multiple Rack Format:

    • Standard notation: resource_value|resource_value (Example: m06|m07)

    • Critical: No spaces are permitted between rack identifiers and the delimiter (|)

  • After entering the correct resource_tag and resource_value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/image15.png

IMPORTANT NOTES:

  1. how to get the value for resource_value (e.g. rack_name):

    1. Be noted the value (e.g. rack_name) is generated and captured automatically from BCM Inventory. below are the steps to get the value

    2. Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

      _images/image16.png
    3. Click the button at right top “New Runbook” then you will see below.

      _images/image17.png
    4. In the central page, click “Op Statement” to create your first cell to query the resource

    5. Type “host” in the cell as your first query and press Enter so then you can see all the host information as in the below example.

      _images/image18.png
    6. You will be able to see the value for the chosen tag (e.g. “rack_name”). In case it’s not showing up, click “Show Panel” and type the tag name (e.g. “rack_name”) in Search and ensure it’s selected.

      _images/image19.png
  2. Predefined the timeout for SRT is 4 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

_images/image21.png

Below is the list of all mission control related runbooks including its name and description.

Category

Runbook Name

Description

DGX_SUPERPOD_BASELINE_TESTING

EntryPoint Runbook

SRT

DGX_SUPERPOD_BASELINE_TESTING_SRT

EntryPoint Runbook of SRT

SRT1

EntryPoint Runbook of all single node health checks

SINGLENODE_HEALTHCHECK_GPU_CPU

Baseline health checks for GPU and CPU

SINGLENODE_HEALTHCHECK_MEMORY_STORAGE

Baseline health checks for Memory and Storage

SINGLENODE_HEALTHCHECK_NETWORK

Baseline health checks for Network

SINGLENODE_HEALTHCHECK_SOFTWARE

Baseline health checks for installed software

SINGLENODE_HEALTHCHECK_FIRMWARE

Baseline health checks for firmware

SRT2

EntryPoint Runbook of component testing

SR_MEMORY_BENCHPRESS

Benchmark testing for memory

SR_CUDA_SAMPLES

Benchmark testing for CUDA

SR_P2P_IPERF

Benchmark testing for pairwise ethernet interfaces

HPL_TEST_SINGLE_NODE

HPL testing on single node separately

HPL_TEST

HPL testing on the single rack

SR_NVBANDWIDTH

Bandwidth testing running in single nvldomain

NCCL_TEST

NCCL testing on the single rack

SRT3

EntryPoint Runbook of burn-in performance testing

HPL_TEST_BURN_IN

HPL testing on the single rack with long duration

MRT

DGX_SUPERPOD_BASELINE_TESTING_MRT

EntryPoint Runbook of MRT

MRT1

EntryPoint Runbook of rack level connectivity testing

MR_INFINIBAND_CHECK

InfiniBand connectivity check

MRT2

EntryPoint Runbook of multi-rack performance testing

MR_HPL_TEST

HPL testing cross multiple racks

MR_NCCL_TEST

NCCL testing cross multiple racks

MR_HPL_TEST_BURN_IN

HPL testing cross multiple racks with long duration

MR_NCCL_TEST_BURN_IN

NCCL testing cross multiple racks with long duration

MRT3

EntryPoint Runbook of cluster level testing

Nemotron_15B

LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

_images/image20.png
Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

  • Each row represents an individual cell

  • Each cell includes a play button for isolated execution

  • Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

  1. Parameters Contains all required inputs for runbook execution

  2. Triggers Configure automated execution methods:

    • Alarm triggers

    • Time Trigger (cron jobs)

    • Other integrations like AlertManager

  3. Users Manage permissions for who may run or edit the runbook

    drawing
  4. Settings General runbook configuration options

    drawing
  5. more runbook operations including:

    • Clone functionality

    • Export options

    • Delete runbook

    • etc.

    drawing

Check Job Status and Result#

Once you’ve initiated the SRT or MRT using the steps above, there are 2 options to check the job status and results:

  1. Click “View Run” right after you “Created the Run” in above sections to redirect you to another page

    _images/image15.png
  2. Navigate to the “Runbook” section in the left panel, then click the “Run” button located in the upper left corner of the page as shown below. Important: Make sure you’ve selected the correct range in the upper right corner before proceeding.

    _images/image25.png

    Note: Runbooks can be nested within other runbooks. When this occurs, you may see a “Execution succeeded - View Run” link after a cell completes. Clicking this link will redirect you to a detailed results page for the nested runbook execution.

    _images/image26.png

In the following part of this section, we will walk you through the topics below.

  1. Check Job status - Whether the job is Running, Completed, Aborted, Terminated or Timed out.

  2. Check Job results - if the job passed or failed, along with detailed logs

Check Job Status#

At the top of each job execution, you will observe one of the following status indicators:

_images/image27.png
Status Types and Definitions#
  1. Running The job is currently executing. Progress is displayed as a percentage based on completed cells.

  2. Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.

  3. Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.

  4. Terminated The job was forcibly ended by the system.

  5. Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)

  6. Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

_images/image28.png
  1. Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.

    _images/image29.png
  2. Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.

  3. Results Section: The bottom portion displays:

  • Exit code status

  • Execution location information

  • Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

  • Configure output display preferences

  • Toggle the density

  • Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

_images/image30.png

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

  • Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SRT or MRT job is recommended.

  • Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.

  • Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

    _images/image30.png

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them. If any execution of the firmware validation runbook results in versions not found, please re-run the runbook.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SRT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

_images/image31.png
Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

_images/image32.png

The Landing page contains two tabs: Report Templates and Published Reports.

_images/image33.png

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SRT and MRT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SRT and MRT tests have been executed.

Guide to initiate Reporting for Baseline Testing#

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.

  • Optional: While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.

    _images/image34.png
  • Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.

    _images/image35.png
  • Retain the auto-generated name for the report which includes the timestamp or provide your own name.

  • Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SRT and MRT linked reports.

  • A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.

    _images/image36.png
  • When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SRT and MRT reports, with the data captured at that given time.

    _images/image37.png
  • All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

    _images/image38.png

Navigating to Previously Published Reports#

  • Navigate to the Reports from the NVIDIA Mission Control autonomous hardware recovery Home Page

    _images/image32.png
  • Click on “Published Reports”, and select the “DGX_SUPERPOD_BASELINE_TESTING_REPORT_timestamp” or any other report of interest.

    _images/image39.png
  • You can also adjust the time range at the top of the screen to view reports generated within specific time frames. Options include viewing reports from the last 10 minutes, last 1 hour, or selecting a custom time range to explore older data beyond these periods.

    _images/image40.png

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#
  • “DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SRT/MRT tests for the compute nodes.

  • Each stage is represented by a separate cell, displaying the results for that specific SRT or MRT test.

    _images/image41.png
  • To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.

    _images/image42.png
  • Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SRT and MRT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SRT1, SRT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

  • The Y-axis displays the Rack name, allowing you to quickly identify the location of each resource within the cluster.

  • The X-axis represents the number of trays within each rack.

For example, in the following visualization, each bar graph indicates the number of trays within a rack. In this case, there are 2 trays per rack, as denoted by the number displayed on the bars. This provides a clear view of the test progress for each tray across different racks. ** note: for NVL72, you will have 18 trays per rack in the report.

_images/image43.png
  • You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.

    _images/image44.png
  • Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

    _images/image45.png

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

  • Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SRT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.

    _images/image46.png
  • Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.

    _images/image47.png
  • In this example, GPU VBIOS Version has failed for all the trays. Click on the bar for one of the racks to list the resources that the test failed on.

    _images/image48.png
  • Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.

    _images/image49.png
  • The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (VBIOS). Alternatively, you can also scroll through the run to find the failed tests.

    _images/image55.png _images/image56.png
  • Click on the “Command filter excluded x/y resources” to get detailed output for each resource.

    _images/image57.png
  • Click on the Output to view the detailed logs

_images/image58.png _images/image59.png

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

  • Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.

  • Expected Version: Shows the firmware version that was expected for the node.

  • Current Version: Displays the actual firmware version currently installed on the node.

Navigating to Tabular Report for Firmware Checks#
  • You can access the SINGLENODE_HEALTHCHECK_FIRMWARE_<timestamp> report either through the Runs (Check Job Status and Result) section or by navigating to the Runs History (see Root Cause Analysis).

  • Scroll to the last cell of the execution and click on the output of the cell.

    _images/image60.png
  • You can scroll horizontally and vertically to view the complete output. Additionally, you have the option to download the cell output for further analysis.

    _images/image61.png
  • You can also navigate to the Reports section and view the tabular output by clicking on the status bar. This includes both the expected firmware version and the current firmware version.

    _images/image136.png

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to the Resource Dashboard#

  • Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.

    _images/image62.png
  • The Landing page contains two tabs: Dashboards and Dashboard Views.

    _images/image63.png
  • The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.

  • A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SRT and MRT.

  • Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.

  • Hostname & Rack Information: For each resource, you will see the hostname and rack name, along with a Tag sequence that indicates the stage of both the SRT and MRT tests.

  • Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SRT and MRT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.

  • Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

    _images/image64.png

The dashboard allows you to sort the resources by Name, Rack, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

  • Name: Sort resources alphabetically by their hostname for quick access.

  • Rack: Sort resources based on their rack assignment, ideal for organizing by physical location.

  • Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

  • Click on the “Create View” button located at the top right corner of your screen.

    _images/image65.png
  • Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.

    _images/image66.png
  • Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.

    _images/image67.png
  • The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

    _images/image68.png

Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, rack, and system levels. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

Alarms dashboard overview

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

  • Prolog Checks (run before the job starts):

    • If a check fails, the node is marked as DRAIN, and the job is re-queued.

    • If it passes, the job proceeds normally.

  • Epilog Checks (run after the job finishes):

    • If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

  • SLURM_CHECKS_ENABLE – enables the checks

  • SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_ENABLE” AND “ALARMS_DISABLE” runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI.

  • Automatically enabled for racks that pass Single Rack Testing

  • Automatically disabled during firmware upgrade and Break/fix

  • Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

  • ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.

  • ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state via ibstat, and also matching the expected transfer rate

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms.

check_bmc_ipmi_version - Checks BMC IPMI version against an expected value

check_nvidia_module_loaded - Verifies the NVIDIA module is loaded in the host OS

check_host_os_version - Verifies the DGX OS version matches the expected value

check_nvsm_status - Verify the NVSM service is currently active

periodic_cpu_mem_checks#

check_cpu_health - Verifies CPU sockets and cores are present and online

check_dimm_count - Checks that all expected memory DIMMs are present

check_dimm_size - Checks that the size of each memory DIMM matches the expected values

check_memory_swap_size - Checks that the memory swap size matches the expected value

periodic_gpu_nvlink_checks#

check_gpu_pci - Checks that all GPUs are present on the lspci interface and with the correct link width and speed

check_gpu_error - Checks GPUs for ECC errors, retired pages, and throttles present

check_gpu_powerstate - Checks the powerstate for each GPU and compares against an expected value

check_gpu_param - Checks that specified GPU parameters are present and correct for the host

check_nvlink_health - Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.

check_gpu_topology - Checks that there are no issues with the p2p topology within the node

check_gpu_telemetry - Checks that various sensors can be successfully read from the GPU via nvidia-smi

check_gpu_power_limit - Checks that the power limit is correct for each GPU

check_nvidia_inforom_ver - Checks that the inforom version is correct for each GPU

check_gpu_clock_info - Checks that the maximum clock speed is correct for each GPU

check_remapped_row - Checks if any remapped row events have occurred

periodic_network_checks#

check_ib_ber_and_ro - Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the ConnectX using mlxlink

check_ib_port_rcv_errors - Check Infiniband devices port RCV errors

check_ib_cables - Checks the cable info using mlxcables

check_bf3_speed - Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#

check_pex_switch_health - Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci

check_cx_config - Checks that the ConnectX devices have the correct PCIe link speed and width via lspci and ACS config via setpci

check_nvme_health - Checks that the PCIe link speed and width of each NMVe device matches the expected value

check_storage_dir - Checks that the host has functional access to the home storage

check_storage_util - Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

Checks for which the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

check_sel_event - Read the SEL events from the BMC and ensure none are asserted

check_dgx_os_version - Verifies the DGX OS version matches the expected value

check_gpu_vbios_ver - Checks the VBIOS version of the GPUs and compares against an expected value

check_nvme_fw_ver - Checks that the FW version for each NVMe matches an expected value

check_kernel_commandline_opt - Verifies the specified kernel option(s) is present in the current kernel’s boot parameters

check_host_bios_ver - Verifies the system’s BIOS version

check_kernel_ver - Verifies the current version of the Linux kernel

check_host_package_versions - Queries the installed packages on the host

nv_container_cli_info - Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval and Automation. An example configuration is shown in the following figure.

Alarm configuration example

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | rack_name =~ “.*”` will only check alarms on hosts which have a value set for the “rack_name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Automation#

You may use the Automation settings to have Runbooks triggered when an alarm fires, and you may also customize the informational messages that are displayed in the Alarm’s logs.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Alarm states screenshot

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

Overview#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your GB200 racks. The four distinct components for which firmware can be upgraded using this process are:

  • Compute trays

  • Switches

  • Mellanox

  • NVOS

Asynchronous component workflows: Firmware upgrade workflows for different component types (compute tray, switch, Mellanox, and NVOS) run asynchronously; powershelf firmware upgrades are not included in asynchronous execution yet. You can schedule, run, and monitor each component workflow independently, and multiple firmware runs may be in progress for different components at the same time. The subsections below describe each workflow; use run history and Firmware Reports to track status across concurrent upgrades.

The workflow invocation is performed via autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search by the FIRMWARE_UPGRADE label as shown below.

_images/fw_image12.png

Using the filter will reduce the runbooks displayed to a list. In general, you will use these runbooks to upgrade the compute trays, switches, or switch NVOSes by following the steps in the next section.

_images/fw_image13.png

Preparing the Upgrade#

In the firmware upgrade runbook, nvfwupd and nv action are used with the upgrade package file to determine the versions needing upgrade. You will need to obtain the firmware package and the the Source of Truth (SOT) JSON file, the latter of which defines the referenced settings used for validation.

The SOT JSON may be obtained from the NVIS team, whereas the firmware packages may be downloaded from the NVIDIA Application Hub.

Source of Truth Snippet (truncated)#

{
    "ProductName": "DGXGB200",
    "SOTUniqueID": "1",
    "SOTType": "Release",
    "Milestones": [
        {
            "TemplateVersion": "0.4",
            "Id": "272d4781-ed44-4579-b390-100aefcebcde",
            "Name": "1.0.00GA",
            "State": "Onboarded",
            "ReleaseDate": "2025-04-17T13:29:24.682615",
            "ReleaseCustomers": [],
            "Packages": [
                {
                    "Name": "GB200NVL72_Compute_Firmware",
                    "Type": "Tarball",
                    "Title": "Compute Firmware",
                    "Description": "Compute Firmware Package"
                },
                {
                    "Name": "GB200NVL72_Switch_Firmware",
                    "Type": "Tarball",
                    "Title": "Switch Tray Firmware",
                    "Description": "Switch Tray Firmware Package"
                },
                {
                    "Name": "GB200NVL72_drivers_cuda",
                    "Type": "Tarball",
                    "Title": "Drivers",
                    "Description": "Drivers Package"
                },

Ordering Constraints#

The runbooks used to perform upgrades automatically determine the ordering of applicable packages and will AUX cycle nodes when appropriate. The following paragraph describes this ordering, but is for informational purposes only, as there is no requirement for the user.

For the compute tray, much older firmware packages require the BMC to be upgraded prior to HMC, but this is no longer the case with modern firmware. Both BMC and HMC can be upgraded within a single AC cycle.

For the switch tray, the BMC firmware should be updated first. SBIOS and CPLD packages may be upgraded within the same AC cycle. Our runbooks take care of this ordering for you.

Coordination with other jobs#

To prevent other tasks from utilizing the nodes undergoing the upgrade process, AHR will do two things:

  1. It will tag the nodes with a special maintenance tag.

  2. Subsequently, it will drain the node via Slurm on BCM.

In particular, this will prevent other upgrade processes from interfering, and will also bypass AHR’s breakfix workflow. This tag will be automatically removed upon successful completion of the upgrade, and the node will be undrained. If there’s an issue during the upgrade process, this tag and drain state will remain for further investigation. At this point of failure, the user should review the failures, and return the nodes to undrained and remove the maintenance tag once the nodes are deemed healthy.

If you need to remove the maintenance tags after the firmware upgrade process encounters an issue, troubleshooting has completed, or even after unsuccessful breakfix triage, you may do so using the CLEAR_MAINTENANCE_TAGS runbook, specifying resource_tag (e.g. rack_name) and resource_value (e.g. B05) as parameters.

Performing the Upgrade#

Entry-point runbook: FIRMWARE_UPGRADE

The FIRMWARE_UPGRADE runbook is the single entry point for upgrading both compute and switch firmware (and optionally Mellanox/NVOS when the corresponding paths are provided). It performs resource scoping, maintenance tagging, drain, and then invokes the appropriate child runbooks: FIRMWARE_UPGRADE_COMPUTE when compute packages are specified, and FIRMWARE_UPGRADE_SWITCH when switch packages are specified. After upgrades, it runs AUX cycle (if enabled), undrains nodes, removes the maintenance tag, and runs firmware validation. Use this runbook for combined compute-and-switch upgrades or for compute-only or switch-only upgrades by setting only the relevant package path(s).

This section details the steps and parameters for each component. The runbooks take care of invoking the process, performing the upgrade, calling any ancillary runbooks, and ultimately performing validation of the upgraded component once complete.

_images/firmware_upgrade_parent_flowchart.png

Figure: Firmware Upgrade (high-level). Detailed flows: Compute Tray (fw_image01), Switch Tray (fw_image02), Switch NVOS (fw_image03), Mellanox (mellanox_fw_upgrade_flowchart).

Compute Tray#

_images/fw_image01.png

Prerequisites#

  1. Download the firmware upgrade packages.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

    • Upgrade packages use the following naming convention:

      • nvfw_dgx*

      • nvfw_hgx*

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point for compute and switch). To upgrade compute firmware, provide the following parameters. The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE when FWPKG_DIR_PATH_COMPUTE is set.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., rack name or regex: A01, m06|m07).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

  4. FWPKG_DIR_PATH_COMPUTE – Directory containing the compute firmware upgrade packages (nvfw_DGX*, nvfw_HGX*). Set to empty if you are only upgrading switch.

Optional parameters (commonly used):

  • FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware packages. Set if you also want to upgrade switch in the same run; leave empty for compute-only.

  • NVOS_FILE_PATH – Full path to the NVOS bin file (if upgrading switch NVOS).

  • FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package. Set this (and FWPKG_BF_FILE_PATH) if you also need to upgrade Mellanox (ConnectX/BlueField) in the same run as compute.

  • FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package. Set this (and FWPKG_CX_FILE_PATH) if you also need to upgrade Mellanox in the same run as compute.

  • FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks; false follows standard upgrade rules.

  • IGNORE_LIST – List of nodes to exclude from scope (e.g., node01, or “none”).

  • AUX_CYCLE – When true (default), performs an AUX power cycle after upgrade; set to false to skip.

When you need to upgrade compute and Mellanox together, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH set to the ConnectX and BlueField package paths. When you need to upgrade only compute firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_COMPUTE set and FWPKG_DIR_PATH_SWITCH left empty. For advanced scenarios you may run the BreakFix_Firmware_Upgrade_Compute_nvfwupd runbook directly (e.g., single rack).

Switch Tray#

_images/fw_image02.png

Prerequisites#

  1. Download the firmware upgrade packages.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

    • Upgrade packages use the following naming convention:

      • *nvfw*_0004_*

      • *nvfw*_0006_*

      • *nvfw*_0007_*

Upgrading#

When performing this upgrade, it should be noted that all of the rack’s 18 nodes will be drained from the Slurm pool, tagged with our maintenance tag, and subsequently AUX cycled.

To begin, from the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook (entry point). To upgrade switch firmware, set FWPKG_DIR_PATH_SWITCH (and optionally NVOS_FILE_PATH for NVOS). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH when FWPKG_DIR_PATH_SWITCH is set. Provide the same resource_tag, resource_value, and FW_SOURCE_JSON_PATH as for compute.

Required parameters (when upgrading switch):

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. FWPKG_DIR_PATH_SWITCH – Directory of the switch firmware upgrade packages. Set to empty if you are only upgrading compute.

Optional: NVOS_FILE_PATH – Full path to the NVOS bin file if upgrading switch NVOS. FORCE_UPGRADE, AUX_CYCLE – same as for compute.

When you need to upgrade only switch firmware, run FIRMWARE_UPGRADE with FWPKG_DIR_PATH_SWITCH set and FWPKG_DIR_PATH_COMPUTE left empty. The FIRMWARE_UPGRADE_SWITCH child runbook is invoked by the parent and is not intended to be run directly. For advanced scenarios you may run the BreakFix_Switch_BMC_Upgrade_Nvfwupd runbook directly.

Switch NVOS#

_images/fw_image03.png

Prerequisites#

  1. Download the NVOS upgrade package bin file.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Switch Tray and Compute. To upgrade switch NVOS, set NVOS_FILE_PATH and optionally FWPKG_DIR_PATH_SWITCH (if also upgrading switch tray firmware). The parent runbook invokes FIRMWARE_UPGRADE_SWITCH, which performs both switch tray firmware and NVOS upgrade when NVOS_FILE_PATH is provided.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. NVOS_FILE_PATH – Full path to the NVOS bin file including the file itself.

Optional: FWPKG_DIR_PATH_SWITCH – Set if you are also upgrading switch tray firmware in the same run; leave empty if only upgrading NVOS.

Alternatively, run BreakFix_NVOS_Upgrade directly with resource_value, NVOS_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_SWITCH (from flex query or parent context).

Mellanox#

_images/mellanox_fw_upgrade_flowchart.png

Figure: Mellanox FW Upgrade (ConnectX and BlueField) — from BreakFix_Firmware_Upgrade_Mellanox runbook.

Prerequisites#

  1. Download the BlueField and ConnectX upgrade packages.

  2. Obtain the golden configuration JSON file from the NVIS team.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

Use the FIRMWARE_UPGRADE runbook (entry point), same as for Compute and Switch. To upgrade Mellanox (ConnectX and BlueField) firmware, set FWPKG_CX_FILE_PATH and FWPKG_BF_FILE_PATH; you can set FWPKG_DIR_PATH_COMPUTE to a directory (or leave empty if only Mellanox). The parent runbook invokes FIRMWARE_UPGRADE_COMPUTE, which performs ConnectX and BlueField firmware upgrade when these paths are provided.

Required parameters:

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01, or regex).

  3. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. If the file is not available, set this parameter to NA.

  4. FWPKG_CX_FILE_PATH – Full path to the ConnectX firmware package file.

  5. FWPKG_BF_FILE_PATH – Full path to the BlueField firmware package file (either .bin or .bfb package).

Optional: FWPKG_DIR_PATH_COMPUTE – Directory of compute firmware packages if you are also upgrading compute tray in the same run; leave empty for Mellanox-only.

Alternatively, run BreakFix_Firmware_Upgrade_Mellanox directly with FWPKG_CX_FILE_PATH, FWPKG_BF_FILE_PATH, FW_SOURCE_JSON_PATH, and INPUT_COMPUTE (from flex query or parent context).

BF Firmware Bundle Extraction Guide#

This guide explains how to extract a BF firmware package provided as a BF Bundle (.bfb).

Important:
These steps must be performed on the compute node, as it already has the bfb-tool utility installed.

Prerequisites#

Ensure the following dependency is installed:

sudo apt install -y qemu-user-static

Extracting the BF Bundle

Note: The --bfb argument must use the complete (absolute) path to the .bfb file. Relative paths may cause the extraction to fail.

bfb-tool extract \
  --bfb /path/to/bf-fwbundle-<version>-prod.bfb \
  --opn 900-9D3B6-00CN-P_Ax

Using the Extracted Firmware

After extraction, go to the folder created in /tmp (named after the .bfb file). Inside this folder, open the subfolder corresponding to your OPN (e.g., 900-9D3B6-00CN-P_Ax). In that subfolder, locate the .bin firmware file, which should be used as the input to the runbook.

Powershelf#

GB200 supports firmware upgrades for power shelves (LiteOn and Delta). Use the FIRMWARE_UPGRADE_POWERSHELF runbook. It upgrades PMC (Power Management Controller) first, then PSU, and finally runs SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF for validation.

Prerequisites#

  1. Obtain the PSU and PMC firmware packages and the golden configuration JSON file (from the NVIS team or firmware package).

  2. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE_POWERSHELF runbook. Required parameters (per runbook):

  1. resource_tag – Tag for flex resource query (use rack_name for rack filtering).

  2. resource_value – Value for flex resource query (e.g., A01, or regex).

  3. IGNORE_LIST – List of nodes to ignore for baseline tests. Accepted formats: single node (e.g. node01), list (e.g. [“node01”, “node02”]), pipe-delimited (e.g. node01|node02), or “none” if no nodes should be ignored.

  4. PSU_FILE_FULL_PATH – Full path to the directory and file name of the PSU firmware package.

  5. PMC_FILE_FULL_PATH – Full path to the directory and file name of the PMC firmware package.

  6. FW_SOURCE_JSON_PATH – Path to the golden configuration JSON file provided within the firmware package. This file defines the reference settings used for validation.

  7. FORCE_UPGRADE – When set to true, forces the firmware upgrade to proceed regardless of version checks or validation; false follows standard upgrade rules.

The runbook invokes POWERSHELF_COMPONENT_UPGRADE for PMC first, then PSU, and finally SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to verify the upgrade.

Powershelf firmware upgrade workflow

Figure: Powershelf Firmware Upgrade Workflow – This diagram illustrates the end-to-end Powershelf firmware upgrade process. FIRMWARE_UPGRADE_POWERSHELF obtains the target powershelves via flex query, then invokes POWERSHELF_COMPONENT_UPGRADE for PMC (Power Management Controller) first and PSU next, each with optional exit on failure. The workflow concludes with SINGLENODE_HEALTHCHECK_FIRMWARE_POWERSHELF to validate the upgrade against the golden configuration JSON.

Troubleshooting#

1. Excluding a node from upgrade or runbook cells#

To exclude a particular node, you can add a cell near the start of your runbook (after INPUT_SWITCH is exported) with something like the following:

INPUT_SWITCH | name != "<node_name>" | export(“INPUT_SWITCH")

This will export all the resources except <node_name> back into the INPUT_SWITCH variable.

Firmware upgrade includes a series of steps to ensure the nodes are removed from jobs being scheduled, complete the upgrade, and place the nodes back in service. Below are some common issues you may encounter during the firmware upgrade:

2. Nodes not reachable from headnode#

To upgrade firmware, the BMC IP must be accessible from the headnode. The runbook verifies node accessibility and automatically skips unreachable nodes.

Action: Ensure the node is online and accessible from the headnode, then rerun the firmware upgrade runbook.

_images/fw_image04.png

3. Failed firmware upgrade#

The firmware upgrade failed to complete successfully. Possible causes include failures in the nvfwupd command, NV OS, or the flint command, depending on the package.

Action: Verify logs by clicking on the Output, which includes the command’s stdout and stderr.

_images/fw_image05.png _images/fw_image06.png

Note that the runbook does not automatically undrain or untag maintenance when the firmware upgrade fails. After verifying that the failures are safe to ignore and the nodes are ready to return to the pool, undrain, and untag the nodes using the UNDRAIN_AND_UNTAG_RACKS runbook. Provide resource_tag (e.g. rack_name) and resource_value (e.g. B05 or m06|m07) when prompted.

_images/fw_image07.png

4. SSH failures for Switch Upgrade runbooks.#

The switch firmware upgrade runs commands via SSH. The runbook dynamically retrieves the user and password from BCM using cmsh commands, which are then used to run the SSH commands.

Action: Verify SSH access to the switch from the headnode using the credentials stored in BCM.

_images/fw_image08.png

5. Failed validation stage#

The last step in every firmware upgrade is validation. The runbook selects a subset of tests to verify upgrade success. Failures may result from upgrade issues, incorrect SOT JSON, or command failures when fetching component versions.

Action: Compare the expected and actual versions in the logs and check for any other errors.

Notes#

  • Netcat is used to check if nodes are back online after a reboot.

  • The runbook cannot exclude a subset of nodes. This means if any nodes are down, the runbook will ignore the node and upgrade others.

  • Multiple racks cannot be upgraded simultaneously.

  • If nvfwupd does not upgrade, (due to already being at the specific version, for example) and FORCE_UPGRADE is not specified as true, the runbook will exit after untagging from maintenance.

Links#

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

Break/Fix Introduction#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle tray failures for GB200. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

_images/image93.png

Figure: Compute Break/Fix Workflow (high-level) — Trigger, Triage, Validation, then Return to service or Run diag / Support ticket.

_images/image94.png

Figure: Compute Break/Fix Workflow (detailed) — Explains each phase: 1. Triage (connectivity check, SRAM UC check, leak detection, power cycle, GPU recovery), 2. Verification (BREAKFIX_COMPUTE_TRAY_VALIDATION, tests passed?), 3. Run diag (BREAKFIX_DIAG_DUMP, support ticket creation). All paths from triage converge on validation; failed validation leads to diagnostic dump and support ticket; success leads to return to service.

Key Features#

  • Automatic Detection: Identifies drained nodes without manual intervention

  • Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms

  • Comprehensive Diagnostics: Performs thorough hardware and software checks

  • Automated Remediation: Attempts to resolve issues without human intervention when possible

  • Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

To access the break/fix interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “BREAKFIX_TRIGGER” runbook through the search functionality

  • Upon selecting the “BREAKFIX_TRIGGER” runbook, you will be presented with the interface showing the workflow for automated break/fix.

    _images/image95.png

Run Break/Fix Workflow#

The BREAKFIX_TRIGGER is a time-triggered runbook that automatically runs every 5 minutes. When executed, it:

  • Gets drained nodes from BCM and proceeds only if the drain reason is one of the allowed reasons. Otherwise the runbook exits with no action.

  • Allowed drain reasons (runbook exits if no nodes have these reasons): Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.

  • Processes one node per execution cycle from the drained node pool; applies a maintenance tag in AHR to prevent duplicate processing, and ensures drained nodes are handled sequentially across multiple runs.

  • Verifies the AHR agent is connected and accepts commands before proceeding; exits if the agent does not accept commands.

  • Gets the drain reason from BCM, updates it with “(AHR in-progress)”, and re-drains the node via BCM with the updated reason.

  • Invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously to perform triage on the selected compute node.

  • No manual intervention is required to start the process.

The BREAKFIX_TRIGGER runbook is shared across GB300, GB200, and B200. It has no user-configurable scope parameters; it discovers drained nodes from BCM automatically.

One can see the time trigger settings on the right side of the Runbook under Triggers, where it shows that it is currently enabled and runs every 5 minutes.

_images/image96.png

You can also manually trigger the workflow:

  • Navigate to the “BREAKFIX_TRIGGER” runbook in the Runbooks section

  • Select the “Create Run” button positioned in the top right corner of the interface

    _images/image97.png
  • After initiating the process, a confirmation dialog will appear with a “View Run” link

  • Selecting this link will redirect you to a page displaying comprehensive job status and details

The automated nature of this workflow ensures that system issues are addressed promptly without requiring constant monitoring or manual intervention.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute trays. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry-point runbook (shared for GB300, GB200, and B200 compute break/fix) that:

  • Runs automatically every 5 minutes via time trigger.

  • Gets drained nodes from BCM and only proceeds if the drain reason is one of the allowed reasons: Excluded by ARE, Prolog/Epilog error, Kill task failed, Duplicate jobid, Low RealMemory, Drained by CMDaemon, Test breakfix flow, or Not responding.

  • Processes one node per run: selects a drained node not already in maintenance, verifies AHR agent accepts commands, sets the maintenance tag, gets and updates the drain reason with “(AHR in-progress)”, re-drains via BCM, then invokes BREAKFIX_COMPUTE_TRAY_TRIAGE(INPUT, DRAIN_REASON) asynchronously.

  • Has no user-facing parameters; node discovery and scope are driven by BCM drained state.

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute trays.

_images/image98.png

This runbook performs comprehensive triage on drained compute trays with two main workflows:

Workflow for Unresponsive Compute Nodes#
  1. Initial Assessment

    • Tests connectivity using ping to check if compute nodes are responsive or unresponsive

    • Identifies nodes that are already unresponsive and require recovery

  2. Leak Detection

    • Checks if any leaking is reported through BCM

    • Creates Support ticket immediately for any nodes with detected leaking (if opted in to support ticket service)

  3. Recovery Process for Non-Leaking Nodes

    • Initiates power cycle for nodes without leaking issues

    • Waits and checks if hosts come back online

    • Waits until the AHR agent is connected to confirm successful recovery

  4. Failure Handling

    • Creates Support ticket for hosts that fail to start up (if opted in to support ticket service)

  5. Validation

Workflow for Responsive Compute Nodes#
  1. GPU Recovery Assessment

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

_images/image99.png

This specialized diagnostic runbook focuses on GPU-related issues:

  1. Verification and Assessment

    • Verifies the node is still drained from BCM

    • Categorizes recovery actions (Reboot, Reset, or None)

  2. Recovery Actions Based on Type

    • For Reboot Action:

      • Reboots the node requiring GPU reboot

      • Waits for host to come back online

      • Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION if host is up

      • Creates Support ticket for host that fail to start up (if opted in to support ticket service)

    • For Reset Action:

    • For No Action Required:

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

_images/image100.png

After remediation actions, this runbook validates the system health:

  1. Comprehensive Testing

    • Runs testing suites to validate the compute tray

    • Executes HPL test (HPL use mpirun for single-node execution instead of Slurm for break/fix scenarios)

    • Runs AHR prolog script to prevent undraining of nodes that are still failing prolog checks, since undraining will result in them being drained again at the next Slurm invocation

  2. Result Handling

    • For failed tests, automatically runs BREAKFIX_DIAG_DUMP for detailed diagnostics

    • For passed tests, undrains/untags the host to return it to service

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

_images/image101.png

This runbook collects comprehensive diagnostic information:

  • Runs NVSSVT (NVIDIA System Software Validation Toolkit)

  • Collects NVSM (NVIDIA System Management) health dumps

  • Executes EUD (End User Diagnostics)

  • Runs Partnerdiag if necessary

  • Creates a consolidated diagnostic log dump package

  • Generates a Support ticket with diagnostic log for support (if opted in to support ticket service)

Prerequisites for EUD and Partnerdiag:

  • EUD: Binary must be installed on every compute node for execution

  • Partnerdiag: Binary must be installed on the head node under path /cm/shared/partnerdiag

  • If these binaries are not properly installed, EUD and Partnerdiag will be skipped during diagnostic collection

View Break/Fix Result#

Users can monitor break/fix operations and determine outcomes through multiple methods:

Accessing Break/Fix Results#

Via Runbook Execution View:

  1. Click “Runs” in the upper left corner

  2. Filter by “BREAKFIX_TRIGGER” to see all break/fix executions

  3. Select a specific run to view detailed execution flow

    _images/image116.png

Via Resource Run History:

  1. Navigate to “Resources” in the left panel and search for the resource (e.g., “b06-p1-dgx-06-c05”)

    _images/image117.png
  2. Click on the resource name

    _images/image118.png
  3. View the “Run History” page which displays all runbooks the resource participated in, with execution timestamps and status

    _images/image119.png
  4. Filter or search for BREAKFIX related runbooks to see the complete history of remediation attempts for that specific node

    _images/image120.png

Understanding Break/Fix Outcomes#

Successful Recovery Indicators:

  • Node status changes from “DRAINED” to “IDLE” or “ALLOCATED” in BCM

  • Maintenance tag is removed from the node

  • BREAKFIX_COMPUTE_TRAY_VALIDATION shows “PASSED” status

  • Node is automatically returned to service

Failed Recovery Indicators:

  • Node remains in “DRAINED” state

  • Support ticket is automatically created (if opted in to support ticket service)

  • BREAKFIX_DIAG_DUMP execution indicates diagnostic collection

  • Maintenance tag remains on the node

  • Drain reason is updated to add “(AHR complete)” to indicate AHR processing has finished. Note that “(AHR in-progress)” indicates Break/Fix is still processing the node.

Determining Recovery Path#

GPU Recovery Path:

  • Look for GPU_RECOVERY runbook execution in the workflow

  • Check if GPU reboot or reset actions were performed

  • Validation results indicate GPU functionality restoration

Power Cycle Recovery Path:

  • BREAKFIX_COMPUTE_TRAY_TRIAGE shows auxiliary power cycle execution

  • Node connectivity tests show successful recovery

Monitoring Ongoing Operations#

  • Break/fix operations run every 5 minutes automatically

  • Check the “maintenance” tag to see which nodes are currently being processed

  • Review recent BREAKFIX_TRIGGER executions to track system-wide break/fix operations

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

Break/Fix Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

_images/image102.png

Figure: Break/Fix Post RMA Workflow - This diagram illustrates the comprehensive post-RMA process for compute tray replacement, starting with BCM inventory updates for new MAC addresses, followed by BMC credential management, OPROM configuration, and boot order correction. The workflow then proceeds through firmware updates to ensure correct versions, concludes with system validation including agent connectivity verification and comprehensive testing via BREAKFIX_COMPUTE_TRAY_VALIDATION, and automatically returns validated hardware to service.

Key Features#

  • Automated Configuration: Configures replaced hardware components with proper settings

  • Firmware Updates: Updates firmware to match the required versions for the environment

  • Boot Order Correction: Ensures proper boot sequence for reliable operation

  • Comprehensive Validation: Performs thorough testing to verify hardware functionality

  • Seamless Integration: Automatically returns validated hardware to service

Post RMA Workflow Components#

The Post RMA workflow consists of several key steps that ensure replaced hardware is properly configured and validated:

Physical Replacement Procedures#

  • Compute Tray Removal: Detailed step-by-step instructions for safely removing failed compute trays, including power down procedures, cable disconnection, and proper handling

  • Compute Tray Installation: Comprehensive installation guide covering component migration (M.2 boot drive, E1.S cache drives, HMC, BMC, TPM), rail installation, and cable reconnection

  • Component Migration: Transfer of critical components from old tray to new tray while maintaining proper slot assignments and ESD protection

BCM Inventory Update#

  • ONLY REQUIRED FOR NEW HARDWARE: Updates BCM inventory information using the BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY runbook when a new compute tray is installed

  • Skip this step for repaired trays as MAC addresses remain unchanged

  • Ensures MAC addresses and other hardware identifiers are correctly registered in BCM (new MAC addresses are provided by the Enterprise Support team who manage serial numbers and asset inventory for customer deployments)

  • Enables proper management and monitoring of the replaced hardware

BMC Credential Management#

  • Creates necessary BMC credential files for secure access to hardware components

  • Establishes secure communication channels for configuration operations

BlueField Configuration#

  • Checks if BlueField devices are in NIC mode

  • Enables OPROM on BlueField devices to ensure proper initialization

  • Configures hardware components for optimal operation

Boot Order Correction#

  • Ensures the boot sequence is properly configured

  • Prevents boot failures and improves system reliability

  • Performs power reset through BMC after configuration changes

Connectivity Verification#

  • Verifies SSH connectivity to compute nodes

  • Checks BCM device status to ensure proper registration

  • Confirms network accessibility before proceeding with firmware updates

Firmware Updates#

  • Upgrades compute firmware to the version in the specified package directory

  • Upgrades Mellanox BlueField and ConnectX firmware using the provided package file paths

  • Uses the golden configuration JSON (FW_SOURCE_JSON_PATH) for validation; all components are updated in a single run

System Validation#

  • Waits for hosts to come back online after each firmware update cycle

  • Verifies agent connectivity to ensure management capabilities

  • Runs comprehensive validation tests using BREAKFIX_COMPUTE_TRAY_VALIDATION

  • Opens nodes in BCM and validates Slurm readiness for successful nodes

Running the Post RMA Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new compute tray. Skip this step for repaired trays as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses and hardware identifiers:

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY” runbook in the Runbooks section

    _images/image103.png
  2. Configure the required parameters (MAC addresses and serial numbers are provided by the Enterprise Support team):

    • HOST_NAME: The hostname of the replaced hardware component (e.g., a07-p1-dgx-03-c08)

    • BMC_MAC: MAC Address of the BMC

    • BF3_0_MAC: MAC Address of Predictable Network Interface Name enP6p3s0f0np0

    • BF3_1_MAC: MAC Address of Predictable Network Interface Name enP22p3s0f0np0

    • BF3_0_STORAGE: MAC Address of Predictable Network Interface Name enP6p3s0f1np1

    • BF3_1_STORAGE: MAC Address of Predictable Network Interface Name enP22p3s0f1np1

    • BF3_0_BMC: MAC Address of Interface Name ethX

    • BF3_1_BMC: MAC Address of Interface Name ethY

    • TRAY_SERIAL_NUMBER: Serial Number of the Tray

  3. Select “Create Run” to initiate the BCM inventory update

    _images/image104.png
  4. Monitor the execution progress to ensure successful completion

Step 2: Execute Main Post RMA Workflow#

After successfully updating the BCM inventory (if required for new hardware), proceed with the main Post RMA workflow:

To execute the Post RMA workflow:

  1. Navigate to the “BREAKFIX_POST_RMA” runbook in the Runbooks section

    _images/image105.png
  2. Configure the required parameters:

    • HOST_NAME (required): Name of the replaced host (e.g., a07-p1-dgx-03-c08)

    • FWPKG_DIR_PATH (required): Full path to the directory containing all firmware packages or the individual package (.fwpkg) for compute nodes

    • FWPKG_BF_FILE_PATH (required): Full path to the BlueField (BF3) firmware package file

    • FWPKG_CX_FILE_PATH (required): Full path to the ConnectX firmware package file

    • FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation. Set to NA if not available.

    • resource_tag (optional): Tag for resource query (e.g., rack_name)

    • resource_value (optional): Value for resource query (e.g., rack identifier)

  3. Configure the required secrets (if not already configured):

    • AHR_API_ENDPOINT: The API endpoint URL for NVIDIA Mission Control

    • AHR_TOKEN: Authentication token for API access

    To add or update these secrets:

    1. In the Shoreline UI, go to Settings.

    2. Click on the Secrets section.

    3. Use the + Secret button to create:

      • AHR_API_ENDPOINT — Provide the correct API endpoint.

      • AHR_TOKEN — Provide the secure API token.

    4. If a secret already exists, click its name to update the value.

    5. Click Save to persist the changes.

    🔐 These secrets will be securely injected into the action at runtime.

    _images/image106.png
  4. Select “Create Run” to initiate the workflow

    _images/image107.png
  5. Monitor the execution progress and results

Note: The workflow includes detailed physical replacement instructions that must be followed before executing the automated portions. Ensure all physical replacement steps are completed as outlined in the runbook’s markdown sections.

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA Powershelf#

Break/Fix Post RMA Powershelf Introduction#

The Break/Fix Post RMA Powershelf workflow brings a replaced power shelf tray back into service after an RMA. It is available for GB300 and GB200 and consists of two runbooks: updating BCM inventory for the new powershelf node (when applicable), then running the main powershelf post RMA workflow to verify BCM status, identify the PowerShelf manufacturer (LiteOn or Delta), and upgrade PSU and PMC firmware to the provided versions.

_images/powershelf_post_rma_flowchart.png

Figure: Powershelf Post RMA Workflow — BCM inventory update (if new HW), Check BCM device status, Determine manufacturer (LiteOn or Delta), Upgrade PMC and PSU firmware, Return to service.

Running the Post RMA Powershelf Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW POWERSHELF HARDWARE)#

Note: This step is ONLY required when installing a new power shelf tray. Skip for repaired trays where the BMC MAC is unchanged.

Use the BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY runbook to register the new powershelf in BCM:

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_POWERSHELF_BCM_INVENTORY” runbook in the Runbooks section.

  2. Configure the required parameters:

    • HOST_NAME (required): Host name to update in BCM inventory (e.g., b06-p01-pwr-01).

    • BMC_MAC (required): MAC address of the powershelf BMC.

  3. Select “Create Run” and monitor the execution.

Step 2: Execute Main Powershelf Post RMA Workflow#

After updating BCM inventory (if required for new hardware), run the main powershelf post RMA workflow:

  1. Navigate to the “BREAKFIX_POWERSHELF_POST_RMA” runbook in the Runbooks section.

  2. Configure the required parameters:

    • resource_value (required): Scope value (e.g., rack name) for the workflow.

    • HOST_NAME (required): Name of the replaced powershelf host (e.g., b06-p01-pwr-01).

    • PSU_FILE_FULL_PATH (required): Full path to the PSU tar file for powershelf nodes.

    • PMC_FILE_FULL_PATH (required): Full path to the PMC tar file for powershelf nodes.

    • FW_SOURCE_JSON_PATH (required): Path to the golden configuration JSON from the firmware package; defines reference settings for validation.

  3. Select “Create Run” to start the workflow. The runbook checks BCM device status, determines the PowerShelf manufacturer (LiteOn or Delta), and upgrades PSU and PMC firmware via the firmware upgrade step. Monitor execution until completion.

B200#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures for DGX B200 systems. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to multiple node configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron (Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Note: For B200, unit means an individual compute node.

Baseline testing: Single Unit or Multi Unit test mode, then status, firmware checks, resource dashboard

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one-click implementation of all testing procedures.

To access the baseline testing interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)

    _images/bcn_ahr_image01.png
  • Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the interface shown below. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

    _images/bcn_ahr_image02.png

Run SUT (Single Unit Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Unit Configuration when one or multiple nodes are ready for testing.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SUT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MUT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/bcn_ahr_image02.png
  • To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

    _images/bcn_ahr_image03.png

    You are required to provide the following inputs for the runbook:

    • FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

    • IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:

      • Single node: node01

      • List format: [“node01”, “node02”]

      • Pipe-delimited string: node01|node02

  • After entering the correct value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/bcn_ahr_image04.png
  1. Predefined the timeout for SUT is 6 hours. You can adjust based on your requirements.

Run MUT (Multi Unit Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Unit Configuration.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MUT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SUT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/bcn_ahr_image05.png
  • Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

    _images/bcn_ahr_image06.png
  • After entering the correct Required parameter values, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/bcn_ahr_image07.png
  1. Predefined the timeout for MUT is 48 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

_images/bcn_ahr_image08.png

Below is the list of all mission control related runbooks including its name and description for B200 systems.

Category

Runbook Name

Description

DGX_SUPERPOD_BASELINE_TESTING

EntryPoint Runbook

SUT

DGX_SUPERPOD_BASELINE_TESTING_SUT

EntryPoint Runbook of SUT (Single Unit Testing)

SUT1

EntryPoint Runbook of all single node health checks

SINGLENODE_HEALTHCHECK_GPU_CPU

Baseline health checks for GPU and CPU

SINGLENODE_HEALTHCHECK_MEMORY_STORAGE

Baseline health checks for Memory and Storage

SINGLENODE_HEALTHCHECK_NETWORK

Baseline health checks for Network

SINGLENODE_HEALTHCHECK_SOFTWARE

Baseline health checks for installed software

SINGLENODE_HEALTHCHECK_FIRMWARE

Baseline health checks for firmware

SUT2

EntryPoint Runbook of component testing

MEMORY_BENCHPRESS

Benchmark testing for memory

CUDA_SAMPLES

Benchmark testing for CUDA

HPL_TEST_SINGLE_NODE_MPIRUN

HPL testing on single node separately

NVBANDWIDTH_SINGLE_NODE_MPIRUN

Bandwidth testing running on single node

NCCL_TEST_SINGLE_NODE_MPIRUN

NCCL testing on single node

SUT3

EntryPoint Runbook of burn-in performance testing

HPL_TEST_BURN_IN_SINGLE_NODE_MPIRUN

HPL testing on the single node with long duration

MUT

DGX_SUPERPOD_BASELINE_TESTING_MUT

EntryPoint Runbook of MUT (Multi Unit Testing)

MUT1

EntryPoint Runbook of rack level connectivity testing

INFINIBAND_CHECK

InfiniBand connectivity validation

MUT2

EntryPoint Runbook of multi-rack performance testing

NVBANDWIDTH_MPIRUN

Bandwidth testing across multiple nodes

P2P_IPERF_MPIRUN

Point-to-point network performance testing

HPL_TEST_MPIRUN

HPL testing across multiple nodes

.

NCCL_TEST_MPIRUN

NCCL testing across multiple nodes

HPL_TEST_BURN_IN_MPIRUN

HPL testing cross multiple nodes with long duration

MUT3

EntryPoint Runbook of cluster level testing

Nemotron_15B_MPIRUN

LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

_images/bcn_ahr_image02.png
Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

  • Each row represents an individual cell

  • Each cell includes a play button for isolated execution

  • Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

  1. Parameters Contains all required inputs for runbook execution

  2. Triggers Configure automated execution methods:

    • Alarm triggers

    • Time Trigger (cron jobs)

    • Other integrations like AlertManager

  3. Users Manage permissions for who may run or edit the runbook

    drawing
  4. Settings General runbook configuration options

    drawing
  5. more runbook operations including:

    • Clone functionality

    • Export options

    • Delete runbook

    • etc.

    drawing

Check Job Status and Result#

This section provides comprehensive guidance on monitoring runbook execution progress and interpreting results.

Check Job Status#

To monitor the status of a runbook execution:

  1. After creating a run, click the “View Run” link in the confirmation dialog

  2. Alternatively, navigate to “Runs” in the left panel and select your specific execution

  3. The run details page displays:

    • Overall run status

    • Start time and duration

    • Each cell’s execution status

    • Resource filtering information

    • Output from each step

_images/bcn_ahr_image09.png
Status Types and Definitions#
  1. Running The job is currently executing. Progress is displayed as a percentage based on completed cells.

  2. Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.

  3. Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.

  4. Terminated The job was forcibly ended by the system.

  5. Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)

  6. Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

_images/bcn_ahr_image10.png
  1. Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.

    _images/image29.png
  2. Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.

  3. Results Section: The bottom portion displays:

  • Exit code status

  • Execution location information

  • Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

  • Configure output display preferences

  • Toggle the density

  • Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

_images/image30.png

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

  • Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SUT or MUT job is recommended.

  • Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.

  • Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

    _images/image30.png

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SUT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

_images/image31.png
Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

_images/image32.png

The Landing page contains two tabs: Report Templates and Published Reports.

_images/landing_reports_bchips.png

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SUT and MUT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SUT and MUT tests have been executed.

Guide to initiate Reporting for for Baseline Testing#

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.

  • [Optional] While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.

    _images/b200_baseline_report.png
  • Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.

    _images/b200_publish_report.png
  • Retain the auto-generated name for the report which includes the timestamp or provide your own name.

  • Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SUT and MUT linked reports.

  • A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.

    _images/b200_building_report.png
  • When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SUT and MUT reports, with the data captured at that given time.

  • All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

Navigating to Previously Published Reports#

  • Navigate to the Reports from the NVIDIA Mission Control autonomous hardware recovery Home Page

    _images/image32.png
  • Click on “Published Reports”, and select the “DGX_SUPERPOD_BASELINE_TESTING_REPORT_timestamp” or any other report of interest.

    _images/b200_published_reports.png
  • You can also adjust the time range at the top of the screen to view reports generated within specific time frames. Options include viewing reports from the last 10 minutes, last 1 hour, or selecting a custom time range to explore older data beyond these periods.

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#
  • “DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SUT/MUT tests for the compute nodes.

  • Each stage is represented by a separate cell, displaying the results for that specific SUT or MUT test.

    _images/b200_baseline_report.png
  • To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.

  • Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SUT and MUT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SUT1, SUT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

  • The X-axis represents all the nodes.

  • The Y-axis is not defined, which groups all the nodes into a single bar in the bar chart.

For example, in the following visualization, each bar represents the number of nodes within a cluster. In this case, there are 32 nodes per cluster, as indicated by the number displayed on each bar. This provides a clear view of the test progress for each tray across different clusters.

_images/b200_sut1.png
  • You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.

    _images/b200_sut1_expanded.png
  • Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

  • Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SUT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.

    _images/b200_gpu_cpu.png
  • Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.

    _images/image47.png
  • In this example, GPU Inforom Version has failed on 17 nodes. Click on the bar for one of the nodes to list the resources that the test failed on.

    _images/b200_gpu_inforom.png
  • Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.

    _images/b200_inforom_fail.png
  • The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (Inforom Version). Alternatively, you can also scroll through the run to find the failed tests.

    _images/image55.png _images/image56.png
  • Click on the “Command filter excluded x/y resources” to get detailed output for each resource.

    _images/image57.png
  • Click on the Output to view the detailed logs

_images/image58.png _images/image59.png

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

  • Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.

  • Expected Version: Shows the firmware version that was expected for the node.

  • Current Version: Displays the actual firmware version currently installed on the node.

Navigating to Tabular Report for Firmware Checks#
  • You can access the SINGLENODE_HEALTHCHECK_FIRMWARE_<timestamp> report either through the Runs (Check Job Status and Result) section or by navigating to the Runs History (see Root Cause Analysis).

  • Scroll to the last cell of the execution and click on the output of the cell.

    _images/image60.png
  • You can scroll horizontally and vertically to view the complete output. Additionally, you have the option to download the cell output for further analysis.

    _images/image61.png
  • You can also navigate to the Reports section and view the tabular output by clicking on the status bar. This includes both the expected firmware version and the current firmware version.

    _images/b200_cx7_fw_report.png

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to the Resource Dashboard#

  • Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.

    _images/bcn_ahr_image18.png
  • The Landing page contains two tabs: Dashboards and Dashboard Views.

    _images/bcn_ahr_image19.png
  • The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.

  • A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SUT and MUT.

  • Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.

  • Hostname: For each resource, you will see the hostname, along with a Tag sequence that indicates the stage of both the SUT and MUT tests.

  • Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SUT and MUT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.

  • Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

    _images/b200_dashboard.png

The dashboard allows you to sort the resources by Name, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

  • Name: Sort resources alphabetically by their hostname for quick access.

  • Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

  • Click on the “Create View” button located at the top right corner of your screen.

    _images/b200_dashboard_view.png
  • Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.

    _images/b200_create_dashboardView.png
  • Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.

    _images/image67.png
  • The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

    _images/b200_dashoard_view.png

Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery#

Introduction#

NVIDIA Mission Control autonomous hardware recovery provides a full suite of automated health checks to detect failures at the tray, node, and system levels. In addition, system wide health checks are performed by integrating with the UFM and NMX-M network control planes. Health check data is reported back to BCM’s BaseView and/or the in-cluster LGTM stack. These health checks are performed at two layers: BCM job invocation, and as periodic health checks via NVIDIA Mission Control autonomous hardware recovery.

Alarms Dashboard#

The alarms dashboard is an overview of all alarms and the state of your system. In this view, alarms are summarized by counts of alarms firing, alarms firing most frequently, and a configurable list of most frequently firing, canceled, or resolved alarms. This is meant to be a starting point for any investigations of possible issues with your systems, and you may click any alarm for further details.

Alarms dashboard overview

BCM Slurm Job Lifecycle Checks (Prolog and Epilog)#

When a Slurm job is submitted, the Autonomous Hardware Recovery Agent automatically runs a set of checks at the start and end of the job to validate node health and stability. These are known as Prolog and Epilog checks.

  • Prolog Checks (run before the job starts):

    • If a check fails, the node is marked as DRAIN, and the job is re-queued.

    • If it passes, the job proceeds normally.

  • Epilog Checks (run after the job finishes):

    • If a check fails, the node is also marked as DRAIN.

These scripts are automatically pushed to each node when the job runs, but they are not visible or configurable through the NVIDIA Autonomous Hardware Recovery UI. To review them, navigate to Shoreline_files/scripts/slurm in the NVIDIA Mission Control package.

Note: Prolog and Epilog checks are disabled by default and should only be enabled after the nodes are confirmed to be healthy. Use the following runbooks to manage them:

  • SLURM_CHECKS_ENABLE – enables the checks

  • SLURM_CHECKS_DISABLE – disables the checks

Unlike the Prolog and Epilog checks, Periodic Checks are defined within the NVIDIA Mission Control autonomous hardware recovery interface as Alarms, and are detailed in the next section. They will be automatically enabled for racks that pass Single Rack Testing but may also be manually enabled or disabled for specific racks by running the “ALARMS_ENABLE” AND “ALARMS_DISABLE” runbooks.

Periodic Health Checks (Alarms)#

Periodic Checks are separate from Prolog and Epilog and run at regular intervals to monitor system health. These are managed as Alarms in the NVIDIA Autonomous Hardware Recovery UI.

  • Automatically enabled for racks that pass Single Rack Testing

  • Automatically disabled during firmware upgrade and Break/fix

  • Can be manually enabled or disabled at any time

The alarm_base is the Resource Query they run against. To control them manually, use the following runbooks:

  • ALARMS_ENABLE – enables periodic alarms for selected racks. Note that a node having the maintenance tag will override these settings.

  • ALARMS_DISABLE – disables them

Periodic Checks are fully visible and configurable in the UI through the alarm section. The following is a list of the configured Alarms, grouped by their check interval:

Frequent Checks (5m)#

bmc_sensors#

Checks the sensors from the Baseboard Management Controller (BMC) to ensure the proper data is returned.

sysmem#

Checks that all expected memory DIMMs are present.

dns_host#

Checks the DNS configuration and resolution for the host.

eth_state#

Checks that the ConnectX devices are present, active, and in the physical LinkUp state via ibstat, and also matching the expected transfer rate

raid_count#

Checks that the raid configuration matches the expected mdstat configuration.

gpu_temp_history#

Checks System Event Log (SEL) history looking for GPU temperature issues.

gpu_alloc_temp#

Checks if the GPU temperatures are above a threshold.

periodic_bmc_host_checks#

The following groups of periodic functional checks are a subset of the BCM Prolog checks that run at predefined intervals as NVIDIA Mission Control autonomous hardware recovery Alarms.

check_bmc_ipmi_version - Checks BMC IPMI version against an expected value

check_nvidia_module_loaded - Verifies the NVIDIA module is loaded in the host OS

check_host_os_version - Verifies the DGX OS version matches the expected value

check_nvsm_status - Verify the NVSM service is currently active

periodic_cpu_mem_checks#

check_cpu_health - Verifies CPU sockets and cores are present and online

check_dimm_count - Checks that all expected memory DIMMs are present

check_dimm_size - Checks that the size of each memory DIMM matches the expected values

check_memory_swap_size - Checks that the memory swap size matches the expected value

periodic_gpu_nvlink_checks#

check_gpu_pci - Checks that all GPUs are present on the lspci interface and with the correct link width and speed

check_gpu_error - Checks GPUs for ECC errors, retired pages, and throttles present

check_gpu_powerstate - Checks the powerstate for each GPU and compares against an expected value

check_gpu_param - Checks that specified GPU parameters are present and correct for the host

check_nvlink_health - Checks that links are active for each GPU, the speed is correct, fabric registration has been completed, are running at full bandwidth, and belong to the same NVLink domain and partition.

check_gpu_topology - Checks that there are no issues with the p2p topology within the node

check_gpu_telemetry - Checks that various sensors can be successfully read from the GPU via nvidia-smi

check_gpu_power_limit - Checks that the power limit is correct for each GPU

check_nvidia_inforom_ver - Checks that the inforom version is correct for each GPU

check_gpu_clock_info - Checks that the maximum clock speed is correct for each GPU

check_remapped_row - Checks if any remapped row events have occurred

periodic_network_checks#

check_ib_ber_and_ro - Checks if the PCI_WR_ORDERING field is set to relaxed and also the bit error rate of the ConnectX using mlxlink

check_ib_port_rcv_errors - Check Infiniband devices port RCV errors

check_ib_cables - Checks the cable info using mlxcables

check_bf3_speed - Validates that the BlueField devices are operating at the correct speed and that the proper number of devices are in the “Up” state. This check will run, but never fail

periodic_storage_checks#

check_pex_switch_health - Checks that the PEX switches are present, have the correct PCIe link speed and width, and the downstream devices have enumerated to lspci

check_cx_config - Checks that the ConnectX devices have the correct PCIe link speed and width via lspci and ACS config via setpci

check_nvme_health - Checks that the PCIe link speed and width of each NMVe device matches the expected value

check_storage_dir - Checks that the host has functional access to the home storage

check_storage_util - Checks that the used local storage on the host is below a given threshold

periodic_error_checks#

Checks journald for machine check events, Xid, AER, CPER, I/O, GPU fell off the bus, and other generic errors.

Hourly Checks#

nfs_mounts#

Verifies required mount points.

daily_informational#

Checks for which the severity and remediation may not be critical. This alarm will only be triggered once per day, and results may be viewed in the resulting runbook run.

check_sel_event - Read the SEL events from the BMC and ensure none are asserted

check_dgx_os_version - Verifies the DGX OS version matches the expected value

check_gpu_vbios_ver - Checks the VBIOS version of the GPUs and compares against an expected value

check_nvme_fw_ver - Checks that the FW version for each NVMe matches an expected value

check_kernel_commandline_opt - Verifies the specified kernel option(s) is present in the current kernel’s boot parameters

check_host_bios_ver - Verifies the system’s BIOS version

check_kernel_ver - Verifies the current version of the Linux kernel

check_host_package_versions - Queries the installed packages on the host

nv_container_cli_info - Retrieves information about the NVIDIA container CLI (driver and devices)

Daily Checks#

cpu_microcode_version#

Checks the CPU microcode version and compares against an expected value.

cpu_stepping#

Checks that the CPU stepping parameter is correct for each CPU.

numa_node_count#

Checks that the correct count of Non-uniform memory access (NUMA) nodes are configured with the CPU cores.

NVIDIA Mission Control autonomous hardware recovery Alarm Configuration#

There are several components to an alarm, with the key pieces being the Resource Query, Fire Query, Resolve Query, Check Interval and Automation. An example configuration is shown in the following figure.

Alarm configuration example

Resource Query#

The resource query allows you to customize the resources (hosts, pods, gpus) on which the checks will be performed. In the preceding example, the `hosts | rack_name =~ “.*”` will only check alarms on hosts which have a value set for the “rack_name” tag.

Fire Query#

The fire query is a condition that, when true, will cause the alarm to begin firing. It will be run at each interval.

Resolve Query#

Similar to the fire query, the Resolve query is a condition that will resolve the alarm when true. Resolving an alarm will cause firing to cease and the state to change to Resolved.

Check Interval#

The interval at which the fire and resolve queries are checked.

Automation#

You may use the Automation settings to have Runbooks triggered when an alarm fires, and you may also customize the informational messages that are displayed in the Alarm’s logs.

Alarm States#

If an Alarm has triggered, it will be in one of the following three states:

Triggered#

This state means the alarm is currently firing. Any automation (break/fix) will subsequently be invoked to remediate any issues, potentially resolving the alarm. Alternatively, the user could cancel the alarm by clicking the “Cancel alarm.”

Clicking into the triggering alarm will give you more details on what caused the alarm, metadata and resources relating to the alarm, and will also allow you to view log output from the check itself.

Alarm states screenshot

Resolved#

When the clear query of an alarm evaluates to true for a firing alarm, the status will be changed to Resolved. Automation triggered runbooks will invoke break/fix operations that should be configured to result in a resolved alarm.

Canceled#

When a user cancels an alarm from the dashboard, or from the triggered alarm itself, its state will become Canceled. Also, if an alarm configuration is changed for an alarm in the Triggered state, it will be canceled since it was triggered against a defunct configuration.

Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery#

Overview#

NVIDIA Mission Control autonomous hardware recovery provides functionality for upgrading, cycling, and verifying firmware and the corresponding OS within your B200 racks. The three distinct components for which firmware can be upgraded using this process are:

  • Motherboard Tray (CPU, PCH, BMC)

  • GPU Tray (GPU, NVSwitches, HMC)

  • Mellanox Devices

The workflow invocation is performed via autonomous hardware recovery’s Runbooks. To view all Firmware upgrade related runbooks, you may search by the FIRMWARE_UPGRADE label as shown below.

_images/fw_image12.png

Using the filter will reduce the runbooks displayed to a list. In general, you will use these runbooks to upgrade the compute trays and mellanox deivces, by following the steps in the next section.

_images/b200_fw_runbooks.png

Preparing the Upgrade#

In the firmware upgrade runbook, nvfwupd and mlxfwmanager are used with the upgrade package file to determine the versions needing upgrade. You will need to obtain the firmware package and the the Source of Truth (SOT) JSON file, the latter of which defines the referenced settings used for validation.

The SOT JSON may be obtained from the NVIS team, whereas the firmware packages may be downloaded from the NVIDIA Application Hub.

Source of Truth Snippet (truncated)#

{
  "TemplateVersion": "0.5",
  "Id": "8e755bd7-f6af-4621-96d3-d55a5f0fe45e",
  "Name": "Viking_Release_1.3.2 (25.09.1)",
  "State": "QA Tested",
  "ReleaseDate": "2025-07-01T06:28:08.398Z",
  "ReleaseCustomers": [],
  "Tests": [],
  "Packages": [],
  "BoardSKUs": [
    {
      "SKUID": "P4387 B200",
      "Name": "DGX B200",
      "Components": {
        "Software": [
          {
            "Component": "DGX OS",
            "Version": "7.0.2 RC6",
            "Comments": "Install ISO",
            "External": false,
            "Sideload": false,
            "Informational": false,
            "Locations": [
              {
                "Location": "",
                "LocationType": "HTTPS",
                "Distro": "All",
                "Architecture": "All",
                "PackageName": "",
                "External": false
              }
            ],
            "SubComponents": []
          }
        ],
        "Firmware": [
          {
            "Component": "DGX B200 Motherboard Tray FW ",
            "Version": "1.3.2",
            "SKUID": "",
            "Vendor": "",
            "Comments": "Subcomponents Parsed from FWPKG",
            "External": false,
            "Sideload": false,
            "Bundle": "nvfw_DGX_250629.1.0.fwpkg",
            "Type": "Prod",
            "Informational": false,
            "Locations": [

Ordering Constraints#

The runbooks automatically determine the ordering of applicable packages and will AUX cycle nodes when appropriate. The following paragraph describes this ordering, but is for informational purposes only, as there is no requirement for the user.

Verify#

For the compute tray, much older firmware packages require the BMC to be upgraded prior to HMC, but this is no longer the case with modern firmware. Both BMC and HMC can be upgraded within a single AC cycle.

Coordination with other jobs#

To prevent other tasks from utilizing the nodes undergoing the upgrade process, AHR will do two things:

  1. It will tag the nodes with a special maintenance tag.

  2. Subsequently, it will drain the node on BCM.

In particular, this will prevent other upgrade processes from interfering, and will also bypass AHR’s breakfix workflow. This tag will be automatically removed upon successful completion of the upgrade, and the node will be undrained. If there’s an issue during the upgrade process, this tag and drain state will remain for further investigation. At this point of failure, the user should review the failures, and return the nodes to undrained and remove the maintenance tag once the nodes are deemed healthy.

If you need to remove the maintenance tags after the firmware upgrade process encounters an issue, troubleshooting has completed, or even after unsuccessful breakfix triage, you may do so using the CLEAR_MAINTENANCE_TAGS runbook.

Performing the Upgrade#

This section details the steps involved to upgrade each firmware type. The runbooks below will take care of invoking the process, performing the upgrade, calling any ancillary runbooks, and ultimately performing validation of the upgraded component once complete.

_images/B200_fw_upgrade.png

Prerequisites#

  1. Download the firmware upgrade packages.

  2. Obtain the golden configuration SOT JSON file.

  3. On the headnode, place the JSON and packages in a subdirectory of /cm/shared/apps/autonomous-hardware-recovery/var/firmware/, which is accessible to the shoreline user.

    • Upgrade packages contain motherboard, GPU and network components among others.

Upgrading#

From the Runbooks view, select “Create Run” for the FIRMWARE_UPGRADE runbook. This will prompt you for the following required parameters, and perform the upgrade. Additionally, it will automatically select the appropriate packages for the corresponding components e.g. DGX and HGX.

  1. NODES - Provide the node names using a regular expression. Examples: node01, node01|node02, or node-*.

  2. FORCE_UPGRADE - When set to true, forces the firmware upgrade to proceed regardless of version checks or validation; false follows standard upgrade rules.

  3. BMC_FIRMWARE_PATH - Provide the entire path of the firmware upgrade package for BMC/DGX packages (mother board)

  4. HGX_FIRMWARE_PATH - Provide the entire path of the firmware upgrade package for HGX package (GPU Tray)

  5. MELLANOX_FIRMWARE_PATH - The complete path of the Mellanox firmware package

  6. FW_SOURCE_JSON_PATH - Path to the golden configuration JSON file provided within the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

When running it will invoke other runbooks as necessary, but should you need to upgrade a single component, you may run the following runbooks directly.

  1. BreakFix_Firmware_Upgrade_Compute_nvfwupd_BMC

  2. BreakFix_Firmware_Upgrade_Compute_nvfwupd_HGX

  3. BreakFix_Firmware_Upgrade_Mellanox

AUX_CYCLE which can be used to perform an AUX power cycle is an additional parameter you may change in the individual runbooks.

Note: The firmware upgrade automation follows the general outline described above. Please check the firmware upgrade documentation and, if it differs from the standard procedure mentioned above, take the necessary steps for any additional components or requirements.

Troubleshooting#

1. Nodes not reachable from headnode#

To upgrade firmware, the BMC IP must be accessible from the headnode. The runbook verifies node accessibility and automatically skips unreachable nodes.

Action: Ensure the node is online and accessible from the headnode, then rerun the firmware upgrade runbook.

_images/fw_image04.png

2. Failed firmware upgrade#

The firmware upgrade failed to complete successfully. Possible causes include failures in the nvfwupd command, or mlxfwmanager command, depending on the package.

Action: Verify logs by clicking on the Output, which includes the command’s stdout and stderr.

_images/fw_image05.png _images/fw_image06.png

Note that the runbook does not automatically undrain or untag maintenance when the firmware upgrade fails. After verifying that the failures are safe to ignore and the nodes are ready to return to the pool, undrain, and untag the nodes using the UNDRAIN_AND_UNTAG runbook:

_images/b200_fw_undrain_untag.png

3. Failed validation stage#

The final step in every firmware upgrade is validation. The runbook verifies the current firmware versions against the SOT. Failures may result from upgrade issues, an incorrect SOT JSON file, or command failures when retrieving component versions. Validation may also include components that were not upgraded in the firmware release.

Action: Compare the expected and actual versions in the logs and check for any other errors.

Notes#

  • Netcat is used to check if nodes are back online after a reboot.

  • If nvfwupd does not upgrade, (due to already being at the specific version, for example) and FORCE_UPGRADE is not specified as true, the runbook will exit after untagging from maintenance.

Links#

NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow#

Break/Fix Introduction#

NVIDIA Mission Control autonomous hardware recovery provides automated break/fix workflows to handle node failures for DGX B200 systems. These workflows execute a series of diagnostic steps to determine the cause of the failure and take necessary repair steps and create Support tickets for the issues that cannot be auto resolved.

The automated break/fix workflow is designed to efficiently diagnose and remediate issues, with clear paths for different failure scenarios and comprehensive validation to ensure systems are properly restored to service.

Note: In B200 systems, the break/fix workflow operates at the compute node level (individual DGX B200 servers), whereas in GB200 systems, it operates at the compute tray level (liquid-cooled units containing multiple GPU modules).

_images/bcn_ahr_image12.png _images/bcn_ahr_image13.png

Figure: Compute Break/Fix Workflow - This diagram illustrates the end-to-end automated break/fix process for B200 systems, starting from the BREAKFIX_TRIGGER runbook that monitors for nodes drained with specific reasons every 5 minutes. The workflow shows how nodes are triaged through BREAKFIX_COMPUTE_TRAY_TRIAGE, which routes to either GPU_RECOVERY for GPU-specific issues or directly to validation. All paths converge on BREAKFIX_COMPUTE_TRAY_VALIDATION for comprehensive testing, with failed nodes proceeding to BREAKFIX_DIAG_DUMP for diagnostic collection and support ticket creation, while successful nodes are returned to service.

Key Features#

  • Automatic Detection: Identifies drained nodes without manual intervention

  • Intelligent Triage: Routes to appropriate diagnostic workflows based on failure symptoms

  • Comprehensive Diagnostics: Performs thorough hardware and software checks

  • Automated Remediation: Attempts to resolve issues without human intervention when possible

  • Detailed Reporting: Provides comprehensive logs for RMA or further troubleshooting

Entrypoint of Break/Fix Workflow#

A centralized automated break/fix interface has been established to facilitate streamlined diagnostics and remediation. This unified entry point provides comprehensive access to the break/fix framework, enabling efficient navigation and implementation of all remediation procedures.

To access the break/fix interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “BREAKFIX_TRIGGER” runbook through the search functionality

  • Upon selecting the “BREAKFIX_TRIGGER” runbook, you will be presented with the interface showing the workflow for automated break/fix.

    _images/image95.png

Run Break/Fix Workflow#

The BREAKFIX_TRIGGER is a time-triggered runbook that automatically runs every 5 minutes. When executed, it:

  • Automatically checks if there are any drained nodes from BCM

  • It checks the drain reason from BCM and proceeds only if the reason is in the allowed list (Excluded by ARE, Prolog error, Kill task failed, Epilog error, Duplicate jobid, Low RealMemory, or Not responding); otherwise, the runbook will exit.

  • Processes one node per execution cycle from the Slurm drained node pool and applies a maintenance tag in AHR to prevent duplicate processing, ensuring all drained nodes are handled sequentially across multiple runs

  • For each drained node, initiates the appropriate triage workflow

  • Begins diagnostic procedures based on the drain reason and node symptoms

  • No manual intervention is required to start the process

One can see the time trigger settings on the right side of the Runbook under Triggers, where it shows that it is currently enabled and runs every 5 minutes.

_images/image96.png

You can also manually trigger the workflow:

  • Navigate to the “BREAKFIX_TRIGGER” runbook in the Runbooks section

  • Select the “Create Run” button positioned in the top right corner of the interface

    _images/image97.png
  • After initiating the process, a confirmation dialog will appear with a “View Run” link

  • Selecting this link will redirect you to a page displaying comprehensive job status and details

The automated nature of this workflow ensures that system issues are addressed promptly without requiring constant monitoring or manual intervention.

Break/Fix Workflow Components#

The break/fix system consists of several key components that work together to diagnose and remediate issues with compute nodes. The following is a detailed explanation of each component:

BREAKFIX_TRIGGER#

The entry point runbook that:

  • Runs automatically every 5 minutes via time trigger

  • Checks for any drained nodes in BCM

  • Initiates the triage process for affected nodes

  • Routes to the appropriate diagnostic workflow

BREAKFIX_COMPUTE_TRAY_TRIAGE#

This runbook is automatically triggered by BREAKFIX_TRIGGER and performs comprehensive triage on drained compute nodes.

_images/image98.png

This runbook performs comprehensive triage on drained compute nodes with two main workflows:

Workflow for Unresponsive Compute Nodes#
  1. Initial Assessment

    • Tests connectivity using ping to check if compute nodes are responsive or unresponsive

    • Identifies nodes that are already unresponsive and require recovery

  2. Recovery Process

    • Initiates power cycle for nodes that are down

    • Waits and checks if hosts come back online

    • Waits until the AHR agent is connected to confirm successful recovery

  3. Failure Handling

    • Creates Support ticket for hosts that fail to start up (if opted in to support ticket service)

  4. Validation

Workflow for Responsive Compute Nodes#
  1. GPU Recovery Assessment

GPU_RECOVERY#

This specialized diagnostic runbook is automatically invoked by BREAKFIX_COMPUTE_TRAY_TRIAGE when GPU-related issues are detected.

_images/image99.png

This specialized diagnostic runbook focuses on GPU-related issues:

  1. Verification and Assessment

    • Verifies the node is still drained from BCM

    • Categorizes recovery actions (Reboot, Reset, or None)

  2. Recovery Actions Based on Type

    • For Reboot Action:

      • Reboots the node requiring GPU reboot

      • Waits for host to come back online

      • Automatically runs BREAKFIX_COMPUTE_TRAY_VALIDATION if host is up

      • Creates Support ticket for host that fail to start up (if opted in to support ticket service)

    • For Reset Action:

    • For No Action Required:

BREAKFIX_COMPUTE_TRAY_VALIDATION#

This runbook is automatically executed after remediation actions to validate system health as part of the automated workflow following triage and recovery operations.

_images/image100.png

After remediation actions, this runbook validates the system health:

  1. Comprehensive Testing

    • Runs testing suites to validate the compute node

    • Executes HPL test (HPL use mpirun for single-node execution instead of Slurm for break/fix scenarios)

    • Runs AHR prolog script to prevent undraining of nodes that are still failing prolog checks, since undraining will result in them being drained again at the next Slurm invocation

  2. Result Handling

    • For failed tests, automatically runs BREAKFIX_DIAG_DUMP for detailed diagnostics

    • For passed tests, undrains/untags the host to return it to service

BREAKFIX_DIAG_DUMP#

This runbook is automatically triggered when validation tests fail, collecting comprehensive diagnostic information for support ticket creation.

_images/image101.png

This runbook collects comprehensive diagnostic information:

  • Runs NVSSVT (NVIDIA System Software Validation Toolkit)

  • Collects NVSM (NVIDIA System Management) health dumps

  • Executes EUD (End User Diagnostics)

  • Runs Partnerdiag if necessary

  • Creates a consolidated diagnostic log dump package

  • Generates a Support ticket with diagnostic log for support (if opted in to support ticket service)

Prerequisites for EUD and Partnerdiag:

  • EUD: Binary must be installed on every compute node for execution

  • Partnerdiag: Binary must be installed on the head node under path /cm/shared/partnerdiag

  • If these binaries are not properly installed, EUD and Partnerdiag will be skipped during diagnostic collection

View Break/Fix Result#

Users can monitor break/fix operations and determine outcomes through multiple methods:

Accessing Break/Fix Results#

Via Runbook Execution View:

  1. Click “Runs” in the upper left corner

  2. Filter by “BREAKFIX_TRIGGER” to see all break/fix executions

  3. Select a specific run to view detailed execution flow

    _images/image116.png

Via Resource Run History:

  1. Navigate to “Resources” in the left panel and search for the resource (e.g., “node01”)

    _images/image117.png
  2. Click on the resource name

    _images/image118.png
  3. View the “Run History” page which displays all runbooks the resource participated in, with execution timestamps and status

    _images/image119.png
  4. Filter or search for BREAKFIX related runbooks to see the complete history of remediation attempts for that specific node

    _images/image120.png

Understanding Break/Fix Outcomes#

Successful Recovery Indicators:

  • Node status changes from “DRAINED” to “IDLE” or “ALLOCATED” in BCM

  • Maintenance tag is removed from the node

  • BREAKFIX_COMPUTE_TRAY_VALIDATION shows “PASSED” status

  • Node is automatically returned to service

Failed Recovery Indicators:

  • Node remains in “DRAINED” state

  • Support ticket is automatically created (if opted in to support ticket service)

  • BREAKFIX_DIAG_DUMP execution indicates diagnostic collection

  • Maintenance tag remains on the node

  • Drain reason is updated to add “(AHR complete)” to indicate AHR processing has finished. Note that “(AHR in-progress)” indicates Break/Fix is still processing the node.

Determining Recovery Path#

GPU Recovery Path:

  • Look for GPU_RECOVERY runbook execution in the workflow

  • Check if GPU reboot or reset actions were performed

  • Validation results indicate GPU functionality restoration

Power Cycle Recovery Path:

  • BREAKFIX_COMPUTE_TRAY_TRIAGE shows auxiliary power cycle execution

  • Node connectivity tests show successful recovery

Monitoring Ongoing Operations#

  • Break/fix operations run every 5 minutes automatically

  • Check the “maintenance” tag to see which nodes are currently being processed

  • Review recent BREAKFIX_TRIGGER executions to track system-wide break/fix operations

NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA#

Break/Fix Post RMA Introduction#

The NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA workflow automates the process of bringing hardware components back into service after a Return Merchandise Authorization (RMA) replacement. This workflow ensures that replaced hardware is properly configured, firmware is updated to the correct versions, and the component is thoroughly validated before returning to production.

For B200 systems, the Post RMA workflow focuses on individual compute node replacement, including GPU tray replacement following the procedures outlined in the DGX B200 Field Replaceable Units (FRU) documentation.

_images/bcn_ahr_image14.png

Figure: Break/Fix Post RMA Workflow for B200 - This diagram illustrates the comprehensive post-RMA process for B200 compute node replacement, starting with physical GPU tray replacement (if needed), followed by power-on, boot order verification, and firmware updates to ensure correct versions. The workflow then proceeds through system validation including agent connectivity verification and comprehensive testing via BREAKFIX_COMPUTE_TRAY_VALIDATION, and automatically returns validated hardware to service.

Key Features#

  • Automated Configuration: Configures replaced hardware components with proper settings

  • Firmware Updates: Updates firmware to match the required versions for the environment

  • Comprehensive Validation: Ensures hardware meets all operational requirements before return to service

  • Automatic Service Restoration: Returns validated nodes to production automatically

Post RMA Workflow Components#

The Post RMA workflow consists of several automated steps:

Physical Replacement Procedures#

GPU Tray Replacement Instructions (Two people are needed for this procedure)

  1. Power down the system (system administrator)

    • Cordon the node from job scheduler so no more workloads are sent to the system

    • Shut down the system from the console

    • Confirm the system is turned off

    • Wait 5 minutes before working on the system to make sure the internal components have cooled off

    • Unplug all power cords from the power supplies

    • Note: Unlike GB200, B200 GPU tray has no front cables to disconnect

  2. Prepare workspace and confirm clearance

    • Confirm there is enough space and clearance at the back of the rack to access and remove the GPU tray from the system

    • Prepare a solid flat surface that will hold the GPU tray

    • If needed, unplug cables that are in the way or remove PDUs that might be blocking access

    • If obstructions can’t be removed, the system may need to be pulled out of the rack using a mechanical lift

  3. Release GPU tray from the system

    • Loosen the thumbscrews that hold the release levers in place

    • Pull the ejection levers out to disengage the midplane connectors

    • Note: The levers help eject the GPU tray connectors from the midplane

  4. Pull the GPU tray out of the system

    • Fully extend the levers and begin pulling the tray out

    • Use extreme caution during this procedure to avoid damaging the sliding mechanism

    • The tray will stop sliding and lock halfway out. Use the buttons on both sides to release the tray from the locking mechanism

    • NOTE: Do not use the handles to carry the tray as they will bend and may break

  5. With the help of another person, remove the GPU tray completely

    • Slowly pull out the GPU tray with the help of another technician (requirement due to weight and size)

    • Two people are necessary to support the weight of the component

    • Pull the tray all the way out until fully released from the system

  6. Install the new GPU tray

    • With the help of another person, install the new GPU tray

    • Pick up the new tray with the help of your partner

    • Fully extend the levers so they don’t get in the way during insertion

    • Insert the GPU tray into the slot

    • Two people are necessary to support the weight of the component

  7. Secure the GPU tray in the system

    • Use the levers to help with the mating of the connectors from the GPU tray to the midplane

    • Close the GPU tray levers to lock the tray in place

    • Tighten the GPU tray thumbscrews to secure the tray

  8. Install all power cords

    • Plug in all the power cords to the power supplies

  9. Power on the system and continue with automated workflow

    • Power on system from the console

    • Wait for system to boot up completely

    • Continue with GPU tray post RMA workflow in autonomous hardware recovery (AHR)

Power and Connectivity Verification#

After physical replacement:

  • System powers on successfully

  • BMC connectivity is established

  • Host OS boots properly

  • AHR agent connection is verified

Firmware Updates#

The workflow automatically:

  • Updates BMC firmware to the required version

  • Updates HGX firmware to the required version

  • Updates Mellanox firmware to the required version

System Validation#

Comprehensive validation through BREAKFIX_COMPUTE_TRAY_VALIDATION:

  • Hardware component detection

  • GPU functionality tests

  • Memory and storage validation

  • Network connectivity verification

  • Performance benchmarking (HPL test)

  • Prolog checks to ensure Slurm compatibility

Running the Post RMA Workflow#

Step 1: Update BCM Inventory (ONLY FOR NEW HARDWARE)#

Note: This step is ONLY required when installing a new compute tray. Skip this step for repaired trays as MAC addresses remain unchanged.

For new hardware replacement, update the BCM inventory with new MAC addresses and hardware identifiers:

  1. Navigate to the “BREAKFIX_POST_RMA_UPDATE_BCM_INVENTORY” runbook in the Runbooks section

    _images/image103.png
  2. Configure the required parameters (MAC addresses and serial numbers are provided by the Enterprise Support team):

    • HOST_NAME: The hostname of the replaced hardware component (e.g., a07-p1-dgx-03-c08)

    • BF_PORT2_0_MAC: First Bluefield Port 2 MAC address (enp170s0f1np1)

    • BF_PORT2_1_MAC: Second Bluefield Port 2 MAC address (enp41s0f1np1)

    • BMC_MAC: MAC Address of the BMC (optional)

    • NODE_IDENTITY_MAC: Node Identity MAC (defaults to first Bluefield Port 2 if not specified) (optional)

    • TRAY_SERIAL_NUMBER: Serial Number of the Tray (optional)

  3. Select “Create Run” to initiate the BCM inventory update

    _images/bcn_ahr_image15.png
  4. Monitor the execution progress to ensure successful completion

Step 2: Execute Main Post RMA Workflow#

After successfully updating the BCM inventory (if required for new hardware), proceed with the main Post RMA workflow:

To execute the Post RMA workflow:

  1. Complete Physical Replacement

    • Follow the GPU tray replacement procedures outlined previously

    • Ensure system is powered on and booting

  2. Navigate to the BREAKFIX_POST_RMA Runbook

    • Access the NVIDIA Mission Control autonomous hardware recovery portal

    • Go to Runbooks section

    • Search for “BREAKFIX_POST_RMA”

    _images/bcn_ahr_image16.png
  3. Configure the required secrets (if not already configured):

    • AHR_API_ENDPOINT: The API endpoint URL for NVIDIA Mission Control

    • AHR_TOKEN: Authentication token for API access

    To add or update these secrets:

    1. In the Shoreline UI, go to Settings.

    2. Click on the Secrets section.

    3. Use the + Secret button to create:

      • AHR_API_ENDPOINT — Provide the correct API endpoint.

      • AHR_TOKEN — Provide the secure API token.

    4. If a secret already exists, click its name to update the value.

    5. Click Save to persist the changes.

    🔐 These secrets will be securely injected into the action at runtime.

    _images/image106.png
  4. Configure Required Parameters

    _images/bcn_ahr_image17.png
    • HOST_NAME: The hostname of the replaced node (e.g., “node01”)

    • BMC_FIRMWARE_PATH: Full path to the BMC firmware package directory or .fwpkg file

    • HGX_FIRMWARE_PATH: Full path to the HGX firmware package directory or .fwpkg file

    • MELLANOX_FIRMWARE_PATH: Full path to the Mellanox firmware package directory or .fwpkg file

    • FW_SOURCE_JSON_PATH: Full path to the firmware source json file

    Example:

    HOST_NAME: a04-p01-dgx-04-c16
    BMC_FIRMWARE_PATH: /cm/shared/firmware/nvfw_DGX_250629.1.0.fwpkg
    HGX_FIRMWARE_PATH: /cm/shared/firmware/nvfw_HGX_DGXB100-B200x8_250828.1.1.fwpkg
    MELLANOX_FIRMWARE_PATH: /cm/shared/firmware/network
    FW_SOURCE_JSON_PATH: /cm/shared/firmware/firmware_source.json
    
  5. Create and Monitor the Run

    • Click “Create Run” to initiate the workflow

    • Click “View Run” in the confirmation dialog

    • Monitor progress through each workflow step

  6. Wait for Completion

    • The workflow will take approximately 2-3 hours depending on firmware updates required

    • Monitor for any errors or failures during execution

Post RMA Workflow Results#

Successful Post RMA Indicators#

  • All Steps Complete: Each workflow component shows “Complete” status

  • Firmware Updated: All firmware versions match the golden configuration

  • Validation Passed: BREAKFIX_COMPUTE_TRAY_VALIDATION completes successfully

  • Node Returned to Service: Node is automatically undrained and available for workloads

  • Tags Removed: Maintenance tags are cleared from the node

Failed Post RMA Indicators#

  • Firmware Update Failure: One or more firmware components failed to update

  • Boot Failure: System fails to boot after firmware update

  • Validation Failure: Hardware tests fail during BREAKFIX_COMPUTE_TRAY_VALIDATION

  • Agent Connection Timeout: AHR agent fails to connect within timeout period

Troubleshooting Post RMA Failures#

Firmware Update Failures:

  1. Verify firmware package path is correct and accessible

  2. Check BMC connectivity and credentials

  3. Review firmware package compatibility with hardware

  4. Manually retry firmware update step

Boot Failures:

  1. Check power connections

  2. Verify BMC is accessible

  3. Review console logs for boot errors

  4. Check boot order configuration

Validation Failures:

  1. Review BREAKFIX_COMPUTE_TRAY_VALIDATION logs for specific test failures

  2. Check hardware installation (GPU tray properly seated)

  3. Verify all power connections

  4. Review GPU detection and functionality

  5. May require additional hardware troubleshooting or RMA

Agent Connection Timeouts:

  1. Verify network connectivity

  2. Check firewall rules

  3. Ensure AHR agent service is running

  4. Review agent logs on the compute node

Accessing Post RMA Results#

Via Runbook Execution:

  1. Navigate to “Runs” in the left panel

  2. Filter by “BREAKFIX_POST_RMA”

  3. Select the specific run for your node

  4. Review each step’s execution status and logs

Via Resource History:

  1. Navigate to “Resources”

  2. Search for the replaced node

  3. Click on the node name

  4. Review “Run History” for BREAKFIX_POST_RMA execution

  5. Check timestamps and outcomes

B300#

Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery#

GPU B300 supports Automated Baseline Testing with NVIDIA Mission Control autonomous hardware recovery. The following section mirrors the baseline testing procedures for B300 (SUT/MUT).

Introduction#

The NVIDIA Mission Control autonomous hardware recovery portal enables efficient automation of baseline testing procedures for GPU B300 systems. This comprehensive testing framework can be flexibly executed across various scales of infrastructure, from individual compute nodes to multiple node configurations. The system performs extensive validation of critical compute node components, including CPU/GPU/Memory/storage functionality, network connectivity, and firmware versioning. Additionally, it incorporates industry-standard performance benchmarking tools such as HPL, NCCL, and Nemotron (Large Language Model) to assess system capabilities. This streamlined approach significantly enhances both testing efficiency and thoroughness while reducing execution time.

Note: For B300, unit means an individual compute node.

Baseline testing: Single Unit or Multi Unit test mode, then status, firmware checks, resource dashboard

Entrypoint of Automated Testing#

A centralized automated baseline testing interface has been established to facilitate streamlined test execution and management. This unified entry point provides comprehensive access to the testing framework, enabling efficient navigation and one-click implementation of all testing procedures.

To access the baseline testing interface:

  • Access the NVIDIA Mission Control autonomous hardware recovery portal using your credentials and navigate to the Runbooks section.

  • Locate the “DGX_SUPERPOD_BASELINE_TESTING” runbook through the search functionality (reference the interface depicted following)

    _images/bcn_ahr_image01.png
  • Upon selecting the “DGX_SUPERPOD_BASELINE_TESTING” runbook, you will be presented with the interface shown below. The subsequent sections of this documentation will provide detailed guidance on executing various testing procedures.

    _images/bcn_ahr_image02.png

Run SUT (Single Unit Testing) Job#

Guide to Initiating Baseline Testing Procedures for Single Unit Configuration when one or multiple nodes are ready for testing.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_SUT” component is activated by toggling its switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_MUT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/bcn_ahr_image02.png
  • To initiate a run, select the Create Run button located at the top right corner of the interface. A new window will appear as shown below. For detailed information about each parameter, simply hover over the info icon beside it.

    _images/bcn_ahr_image03.png

    You are required to provide the following inputs for the runbook:

    • FW_SOURCE_JSON_PATH Specify the file path to the golden configuration JSON file included in the firmware package. This file defines the reference settings used for validation. If the file is not available, set this parameter to NA.

    • IGNORE_LIST Provide a list of nodes to exclude from the test only if required. Leave the value as “none” if no nodes need to be ignored. This parameter supports regular expressions. Here are some examples:

      • Single node: node01

      • List format: [“node01”, “node02”]

      • Pipe-delimited string: node01|node02

  • After entering the correct value, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/bcn_ahr_image04.png
  1. Predefined the timeout for SUT is 6 hours. You can adjust based on your requirements.

Run MUT (Multi Unit Testing) Job#

Guide to Initiating Baseline Testing Procedures for Multi-Unit Configuration.

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING” runbook utilizing the previously outlined navigation protocol.

  • Ensure the “DGX_SUPERPOD_BASELINE_TESTING_MUT” component is activated by toggling the switch control. This control is located on the right side of its immediate group of interface icons. The switch indicator in its default deactivated state should be visually distinct from the activated state. Note: When activated, the switch indicator displays as a green circle with a play icon.

  • Verify that the “DGX_SUPERPOD_BASELINE_TESTING_SUT” component is in its deactivated state. This deactivated state is indicated by its corresponding switch control displaying as a grey circular icon containing a muted play symbol, signifying it is ‘Off’.

  • Select the “Save” button positioned in the top right corner of the interface to preserve your settings.

  • Upon successful completion of these preliminary steps, your runbook configuration should reflect the specified parameters as illustrated below.

    _images/bcn_ahr_image05.png
  • Select the “Create Run” button positioned in the top right corner of the interface, the new window will pop out as illustrated below.

    _images/bcn_ahr_image06.png
  • After entering the correct Required parameter values, select the “Create Run” button to initiate the process. A confirmation dialog will appear with a “View Run” link illustrated below. Selecting this link will redirect you to a new page displaying comprehensive job status and details. For additional information regarding job monitoring and results, please refer to the “Check Job Status and Result” section of this documentation.

    _images/bcn_ahr_image07.png
  1. Predefined the timeout for MUT is 48 hours. You can adjust based on your requirements.

Runbook Configurations#

Before you initiate real jobs, we’d like to provide you with a guide on how to check the Runbook Configurations.

Select “Runbook” from the left navigation panel, then use the search field on the right side of the page to find your runbook by name, as shown in the illustration below.

_images/bcn_ahr_image08.png

Below is the list of all mission control related runbooks including its name and description for B300 systems.

Category

Runbook Name

Description

DGX_SUPERPOD_BASELINE_TESTING

EntryPoint Runbook

SUT

DGX_SUPERPOD_BASELINE_TESTING_SUT

EntryPoint Runbook of SUT (Single Unit Testing)

SUT1

EntryPoint Runbook of all single node health checks

SINGLENODE_HEALTHCHECK_GPU_CPU

Baseline health checks for GPU and CPU

SINGLENODE_HEALTHCHECK_MEMORY_STORAGE

Baseline health checks for Memory and Storage

SINGLENODE_HEALTHCHECK_NETWORK

Baseline health checks for Network

SINGLENODE_HEALTHCHECK_SOFTWARE

Baseline health checks for installed software

SINGLENODE_HEALTHCHECK_FIRMWARE

Baseline health checks for firmware

SUT2

EntryPoint Runbook of component testing

MEMORY_BENCHPRESS

Benchmark testing for memory

CUDA_SAMPLES

Benchmark testing for CUDA

HPL_MXP_TEST_SINGLE_NODE_MPIRUN

HPL testing on single node separately

NVBANDWIDTH_SINGLE_NODE_MPIRUN

Bandwidth testing running on single node

NCCL_TEST_SINGLE_NODE_MPIRUN

NCCL testing on single node

SUT3

EntryPoint Runbook of burn-in performance testing

HPL_MXP_TEST_BURN_IN_SINGLE_NODE_MPIRUN

HPL testing on the single node with long duration

MUT

DGX_SUPERPOD_BASELINE_TESTING_MUT

EntryPoint Runbook of MUT (Multi Unit Testing)

MUT1

EntryPoint Runbook of rack level connectivity testing

INFINIBAND_CHECK

InfiniBand connectivity validation

MUT2

EntryPoint Runbook of multi-rack performance testing

NVBANDWIDTH_MPIRUN

Bandwidth testing across multiple nodes

P2P_IPERF_MPIRUN

Point-to-point network performance testing

HPL_MXP_TEST_MPIRUN

HPL testing across multiple nodes

.

NCCL_TEST_MPIRUN

NCCL testing across multiple nodes

HPL_MXP_TEST_BURN_IN_MPIRUN

HPL testing cross multiple nodes with long duration

MUT3

EntryPoint Runbook of cluster level testing

Nemotron_15B_MPIRUN

LLM testing with mocked data

Runbook Interface Guide#

When accessing the runbook as shown in the example below, please note these important configuration elements:

_images/bcn_ahr_image02.png
Central Workspace#

The main content area displays your resource queries, commands, scripts, or nested runbooks.

  • Each row represents an individual cell

  • Each cell includes a play button for isolated execution

  • Toggle switches allow you to enable/disable specific cells

Configuration Panel (Right Side)#

The right panel contains several critical configuration sections:

  1. Parameters Contains all required inputs for runbook execution

  2. Triggers Configure automated execution methods:

    • Alarm triggers

    • Time Trigger (cron jobs)

    • Other integrations like AlertManager

  3. Users Manage permissions for who may run or edit the runbook

    drawing
  4. Settings General runbook configuration options

    drawing
  5. more runbook operations including:

    • Clone functionality

    • Export options

    • Delete runbook

    • etc.

    drawing

Check Job Status and Result#

This section provides comprehensive guidance on monitoring runbook execution progress and interpreting results.

Check Job Status#

To monitor the status of a runbook execution:

  1. After creating a run, click the “View Run” link in the confirmation dialog

  2. Alternatively, navigate to “Runs” in the left panel and select your specific execution

  3. The run details page displays:

    • Overall run status

    • Start time and duration

    • Each cell’s execution status

    • Resource filtering information

    • Output from each step

_images/bcn_ahr_image09.png
Status Types and Definitions#
  1. Running The job is currently executing. Progress is displayed as a percentage based on completed cells.

  2. Completed The job has finished execution. Note that completion status does not guarantee successful results. Please review the detailed output in the results section.

  3. Aborted The job terminated prematurely due to execution errors, such as cell syntax issues.

  4. Terminated The job was forcibly ended by the system.

  5. Timed Out The job exceeded its maximum allowed execution duration (default timeout is 1 hour for runbook, 1 minute for action)

  6. Canceled The job was manually terminated by a user.

Check Job Results#

When the job status displays “Completed,” you may proceed to review the job results. If the job status shows otherwise, you may click into the job for more details and troubleshooting.

Cell Structure Overview#

Each cell in the runbook contains three primary components:

_images/bcn_ahr_image10.png
  1. Main Content Area: This section displays the executed script or command. On the right side of the cell, you’ll find several control icons. While most icons were detailed in the previous section, the “fx” icon is particularly valuable as it displays all parameters along with input/output values for the specific cell when hovering over it.

    _images/image29.png
  2. Execution Information Bar: Located in the middle of the cell, this light grey text line indicates the execution start time and duration of the operation.

  3. Results Section: The bottom portion displays:

  • Exit code status

  • Execution location information

  • Complete command output (accessible by clicking the “Output” column contents)

Additional Features:

  • Configure output display preferences

  • Toggle the density

  • Download results in various formats using the download options menu

Streamlined Error Navigation: When a job contains numerous cells, manually checking for failures becomes inefficient. Use the “Error outline” feature in the middle panel to quickly locate problematic cells. Simply click any item in this list to automatically navigate to the corresponding failed cell.

_images/image30.png

Notes: there are reports available for the major runbooks. Details can be found “Reports of Testings

Handling Job Failures#

In the event of job or cell execution failures, the following remediation options are available:

  • Major Issue Resolution: Upon resolution of critical infrastructure issues (e.g., hardware replacement), a complete re-initialization of the SRT or MRT job is recommended.

  • Targeted Component Resolution: When specific components have been updated (e.g., firmware version upgrades), execute the relevant job or runbook within the existing session by selecting the “Run” button located in the upper-right interface section. This maintains all previously established parameters.

  • Individual Cell Correction: For isolated cell failures that have been addressed, execute the specific cell independently by activating the execution control (play button) positioned on the right margin of the cell interface. Note: This option might cause the difficulty to locate the job/run from the reports panel (Please refer Reports of Testing section below).

    _images/image30.png

Firmware checks#

NVIDIA Mission Control autonomous hardware recovery includes firmware checks that extract the current firmware versions of the trays and switches, and compare them with the expected versions specified in the Source of Truth (SOT) file. The SOT file includes the expected versions for all components such as OS, HMC, ConnectX, etc., and is prepopulated. The SOT file can be obtained from the NVIS team.

The runbook extracts the expected versions of all firmware components and compares the current versions against them.

Updating the SOT file#

Place the SOT JSON file on the headnode and provide the complete file path as the input FW_SOURCE_JSON_PATH to the DGX_SUPERPOD_BASELINE_TESTING runbook. To update the file, simply replace it with the new file and update the path in the input parameter of the runbook accordingly.

Thresholds and Defaults#

The thresholds and default values for various tests are defined in the Golden Config File. The runbook picks the appropriate values, and compares it against the values on the nodes. This includes defaults such as number of GPUs and expected benchmarks for benchmarking tests (such as SRT2).

Golden Config File#

The Golden Config File (referred to as the defaults.env file on the trays and control nodes) contains the expected values for all benchmark thresholds, and other relevant settings. This file is distributed across all trays and loaded as environment variables, making its contents available during testing.

_images/image31.png
Updating the Golden Config File#

To update the golden config values, edit config/<CHIP>/defaults.yaml (for example, config/B200/defaults.yaml or config/GB200/defaults.yaml). The chip-specific env files are generated automatically from these YAML files during deployment — do not edit the generated files under Shoreline_files/generated/ directly.

Once the changes are made and saved, use the NVIDIA Mission Control autonomous hardware recovery Runbook Deployment section (from the NVIDIA Mission Control AHR installation documentation) to apply them via OpenTofu. This process will create a File object on NVIDIA Mission Control autonomous hardware recovery, which pushes the updated file automatically to all control and compute trays. Various tests, including firmware checks, prolog, and epilog checks, source the defaults.env file and utilize the expected values, now available as environment variables.

Reports of Testings#

NVIDIA Mission Control autonomous hardware recovery provides reports for baseline testing, reflecting the status of nodes (compute and switches) at each test stage. These reports help identify and troubleshoot root causes.

To access the NVIDIA Mission Control autonomous hardware recovery reports, click on Resources in the side menu, then select Reports.

_images/image32.png

The Landing page contains two tabs: Report Templates and Published Reports.

_images/landing_reports_bchips.png

Report Templates provide templates for each stage of Baseline testing. These templates include bar graphs that display the PASS or FAIL status for different nodes during the tests. However, these templates are static and do not store any test data. This means that while you can view the templates, you cannot save or modify the test results within them.

To generate a report that records the test results along with timestamps, click Publish. This action will create a new report based on the template, which will capture the current status of the tests. The report will display PASS for tests that were successful and FAIL for those that did not pass, with each status reflecting the most up-to-date information. The report also includes links to additional reports at the top of the page. These links allow you to access the detailed results of the individual SUT and MUT tests that make up each stage, giving you a deeper insight into the performance and status of each test within the overall baseline testing process.

Note: Reports reflect updated information only after the SUT and MUT tests have been executed.

Guide to initiate Reporting for for Baseline Testing#

  • Navigate to the “DGX_SUPERPOD_BASELINE_TESTING_REPORT” under the Report Templates.

  • [Optional] While templates do not reflect the current status of the test suite, you can use it to view the current state before publishing the reports. Clicking the refresh button at the top of the report will load the current data.

    _images/b200_baseline_report.png
  • Click on “Publish” to generate a timestamped instance of the template that can be easily viewed and shared. Additionally, it automatically publishes all linked reports at the top of the page, ensuring that all related data is included and accessible.

    _images/b200_publish_report.png
  • Retain the auto-generated name for the report which includes the timestamp or provide your own name.

  • Once the process starts, the published reports will begin generating in the background. This includes “DGX_SUPERPOD_BASELINE_TESTING_REPORTS” as well as all the SUT and MUT linked reports.

  • A pop-up notification will appear containing a hyperlink to access the published reports that are being currently generated.

    _images/b200_building_report.png
  • When you publish the “DGX_SUPERPOD_BASELINE_TESTING_REPORTS”, it automatically triggers the publication of all “linked reports”, including the associated SUT and MUT reports, with the data captured at that given time.

  • All the reports will complete building in under a minute, and the “Linked Published Reports” will include the published reports for all the Linked Reports.

Navigating to Previously Published Reports#

  • Navigate to the Reports from the NVIDIA Mission Control autonomous hardware recovery Home Page

    _images/image32.png
  • Click on “Published Reports”, and select the “DGX_SUPERPOD_BASELINE_TESTING_REPORT_timestamp” or any other report of interest.

    _images/b200_published_reports.png
  • You can also adjust the time range at the top of the screen to view reports generated within specific time frames. Options include viewing reports from the last 10 minutes, last 1 hour, or selecting a custom time range to explore older data beyond these periods.

Breakdown of a Published Report#

DGX_SUPERPOD_BASELINE_TESTING_REPORT#
  • “DGX_SUPERPOD_BASELINE_TESTING_REPORT”, is the entry point for the Baseline Testing reports. It provides a comprehensive overview of all the SUT/MUT tests for the compute nodes.

  • Each stage is represented by a separate cell, displaying the results for that specific SUT or MUT test.

    _images/b200_baseline_report.png
  • To perform a deeper analysis or understand the tests in each stage, click on the relevant report listed under “Linked Published Reports” at the top of the page.

  • Similarly, all the reports for each stage are available. You can either find the report of interest from Published Reports, or traverse it from the parent report (DGX_SUPERPOD_BASELINE_TESTING_REPORT_<timestamp> )

Understanding the Published Reports#

The reports align up with the SUT and MUT tests. Each cell in “DGX_SUPERPOD_BASELINE_TESTING_REPORT” represents a specific testing stage such as SUT1, SUT2 etc. Within each of these “Linked Reports”, the cells represent individual tests such as Singlenode Healthchecks, HPL, NCCL etc. The layout of the graph for each of the cells is organized as follows:

  • The X-axis represents all the nodes.

  • The Y-axis is not defined, which groups all the nodes into a single bar in the bar chart.

For example, in the following visualization, each bar represents the number of nodes within a cluster. In this case, there are 32 nodes per cluster, as indicated by the number displayed on each bar. This provides a clear view of the test progress for each tray across different clusters.

_images/b200_sut1.png
  • You can click on any bar in the graph to view a detailed list of the resources that passed or failed the tests within that specific bar. This allows you to drill down and see the test results for each tray in a particular rack or test stage.

    _images/b200_sut1_expanded.png
  • Alternatively, you can click on the legend to filter and display all the passed or failed resources across the entire cluster, providing a comprehensive view of the overall test status.

Root Cause Analysis#

The report for each stage shows the trays that have passed and failed the test. To understand the actual issue and access the logs, you can follow these steps:

  • Navigate to the report with PASS/FAIL values that you are interested to conduct an RCA. In this example, lets consider SUT1 report where there are a few failed tests for SINGLENODE_HEALTHCHECK_GPU_CPU.

    _images/b200_gpu_cpu.png
  • Navigate to the report corresponding to SINGLENODE_HEALTHCHECK_GPU_CPU and scroll to the tests that are failing.

    _images/image47.png
  • In this example, GPU Inforom Version has failed on 17 nodes. Click on the bar for one of the nodes to list the resources that the test failed on.

    _images/b200_gpu_inforom.png
  • Click on the status (FAIL) which directly links you to the runbook where these tests failed. You can also do the same for tests that passed by clicking on the PASS status.

    _images/b200_inforom_fail.png
  • The errors in the runbook are outlined to the left of the screen. Navigate to the test that we are debugging (Inforom Version). Alternatively, you can also scroll through the run to find the failed tests.

    _images/image55.png _images/image56.png
  • Click on the “Command filter excluded x/y resources” to get detailed output for each resource.

    _images/image57.png
  • Click on the Output to view the detailed logs

_images/image58.png _images/image59.png

Firmware Reports#

For Firmware Checks, NVIDIA Mission Control autonomous hardware recovery offers a tabular report detailing the status of each node. The report includes:

  • Pass/Fail Status: Indicates whether the firmware check for each node passed or failed.

  • Expected Version: Shows the firmware version that was expected for the node.

  • Current Version: Displays the actual firmware version currently installed on the node.

Navigating to Tabular Report for Firmware Checks#
  • You can access the SINGLENODE_HEALTHCHECK_FIRMWARE_<timestamp> report either through the Runs (Check Job Status and Result) section or by navigating to the Runs History (see Root Cause Analysis).

  • Scroll to the last cell of the execution and click on the output of the cell.

    _images/image60.png
  • You can scroll horizontally and vertically to view the complete output. Additionally, you have the option to download the cell output for further analysis.

    _images/image61.png
  • You can also navigate to the Reports section and view the tabular output by clicking on the status bar. This includes both the expected firmware version and the current firmware version.

    _images/b200_cx7_fw_report.png

Resource Dashboard#

NVIDIA Mission Control autonomous hardware recovery dashboard provides a comprehensive view of the status of all resources in the cluster, displaying the progress of testing at various stages for each node (control, compute and switch nodes). It shows which tests are completed, which ones have passed, and which have failed. This allows users to easily track the overall health and status of the cluster, identify any issues, and assess the readiness of each resource.

Navigating to the Resource Dashboard#

  • Visit the NVIDIA Mission Control autonomous hardware recovery homepage, navigate to the ‘Resources’ menu, and click on ‘Dashboards’ to access the dashboard page.

    _images/bcn_ahr_image18.png
  • The Landing page contains two tabs: Dashboards and Dashboard Views.

    _images/bcn_ahr_image19.png
  • The Dashboards lists all the Dashboards available. All these dashboards reflect the current status of the cluster and the test status.

  • A Dashboard View is a snapshot of the Dashboard that captures the data at a specific point in time. This snapshot remains static, preserving the exact state and data of the Dashboard as it appeared at that moment, regardless of any future changes or updates to the live data.

DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD#

The DGX_SUPERPOD_BASELINE_TESTING_DASHBOARD provides a detailed view of the cluster’s resources and the status of the baseline tests. Specifically, it tracks the progress of two key testing phases for each resource: SUT and MUT.

  • Resource Status: The dashboard displays all resources within the cluster, such as compute nodes, control nodes, and switch nodes, along with their associated status.

  • Hostname: For each resource, you will see the hostname, along with a Tag sequence that indicates the stage of both the SUT and MUT tests.

  • Test Stage Progress: The rows in the dashboard reflect the current status of each test stage. Each stage (SUT and MUT) has a corresponding tag name that visually represents its test progress, showing whether the stage has been successfully completed, is in progress, or has failed.

  • Snapshot of Cluster Health: The dashboard offers a comprehensive snapshot of the cluster’s readiness. It allows users to identify potential issues early, track the completion status of tests, and quickly assess which resources are ready and which ones may require attention.

    _images/b200_dashboard.png

The dashboard allows you to sort the resources by Name, or Progress Bar, making it easier to organize and view the status of your cluster based on your preferred criteria.

  • Name: Sort resources alphabetically by their hostname for quick access.

  • Progress Bar: Sort by test progress to focus on resources at different stages of testing or to identify incomplete tasks.

Creating a View / Snapshot#

To create a snapshot of the dashboard, follow these steps.

  • Click on the “Create View” button located at the top right corner of your screen.

    _images/b200_dashboard_view.png
  • Retain the auto-generated name for the dashboard which includes the timestamp or provide your own name.

    _images/b200_create_dashboardView.png
  • Once the Dashboard View is created, a pop-up notification will appear with a hyperlink to access the Dashboard View.

    _images/image67.png
  • The Dashboard View created can be downloaded as a CSV to further manipulate the data to generate reports.

    _images/b200_dashoard_view.png

NVIDIA Mission Control autonomous hardware recovery Active Node Updates#

Applies to all GPU platforms in this guide (GB300, GB200, B200, B300).

Active Node Updates Introduction#

The NVIDIA Mission Control autonomous hardware recovery Active Node Updates workflows provide manual failover procedures for updating the active headnode and Slurm controller node designations when the primary nodes go down. These runbooks must be executed manually by administrators to switch to backup nodes when hardware failures occur, ensuring continued system operations.

Important: These are manual procedures that must be initiated by administrators when primary nodes fail. The system does not automatically failover - users must run these runbooks to designate new active nodes.

Key Features#

  • Manual Failover: Switch active node designations when primary nodes fail

  • Resource Tag Management: Automatically manages ACTIVE_HEADNODE and ACTIVE_SLURM_CONTROLLER tags

  • System Validation: Verifies updated node assignments after changes

Active Headnode Update#

The ACTIVE_HEADNODE_UPDATE runbook allows administrators to change which node is designated as the active headnode in the system. This workflow removes the ACTIVE_HEADNODE tag from the current headnode and assigns it to a new specified node.

Workflow Components#

  1. Untag Current Headnode: Removes the ACTIVE_HEADNODE tag from the currently designated headnode

  2. Tag New Headnode: Assigns the ACTIVE_HEADNODE tag to the specified new headnode

  3. Update Resource Definition: Updates the headnode resource definition to point to the new active node

  4. Validation: Verifies the new headnode assignment is correct

Running the Active Headnode Update#

To execute the Active Headnode Update workflow:

  1. Navigate to the “ACTIVE_HEADNODE_UPDATE” runbook in the Runbooks section

    _images/image132.png
  2. Configure the required parameter:

    • ACTIVE_HEADNODE: Input the hostname of the new active headnode

    _images/image133.png
  3. Select “Create Run” to initiate the workflow

  4. Monitor the execution progress and results

The runbook will:

  • Remove the ACTIVE_HEADNODE tag from any currently tagged nodes

  • Apply the ACTIVE_HEADNODE tag with value “True” to the specified node

  • Update the headnode resource definition to reference the new active node

  • Display the updated headnode configuration for validation

Active Slurm Controller Update#

The ACTIVE_SLURM_CONTROLLER_UPDATE runbook allows administrators to change which node is designated as the active Slurm controller in the system. This workflow removes the ACTIVE_SLURM_CONTROLLER tag from the current controller and assigns it to a new specified node.

Workflow Components#

  1. Untag Current Controller: Removes the ACTIVE_SLURM_CONTROLLER tag from the currently designated Slurm controller

  2. Tag New Controller: Assigns the ACTIVE_SLURM_CONTROLLER tag to the specified new controller node

  3. Update Resource Definition: Updates the slurm_controller resource definition to point to the new active node

  4. Validation: Verifies the new Slurm controller assignment is correct

Running the Active Slurm Controller Update#

To execute the Active Slurm Controller Update workflow:

  1. Navigate to the “ACTIVE_SLURM_CONTROLLER_UPDATE” runbook in the Runbooks section

    _images/image134.png
  2. Configure the required parameter:

    • ACTIVE_SLURM_CONTROLLER: Input the hostname of the new active Slurm controller

    _images/image135.png
  3. Select “Create Run” to initiate the workflow

  4. Monitor the execution progress and results

The runbook will:

  • Remove the ACTIVE_SLURM_CONTROLLER tag from any currently tagged nodes

  • Apply the ACTIVE_SLURM_CONTROLLER tag with value “True” to the specified node

  • Update the slurm_controller resource definition to reference the new active node

  • Display the updated Slurm controller configuration for validation

Important Notes#

  • Manual Execution Required: These runbooks must be run manually when primary nodes fail - there is no automatic failover

  • Failover Scenario: Use these runbooks when the primary headnode or Slurm controller becomes unavailable

  • AHR Tag Management Only: These runbooks only update AHR’s understanding of active nodes - they do not perform actual service migration

  • Prerequisites: Ensure backup nodes are properly configured and accessible before running these workflows

  • Post-Execution: Verify that dependent systems recognize the new active node assignments

NVIDIA Mission Control autonomous hardware recovery UI Tips and Troubleshooting#

Selecting Attributes to Display#

You can customize parts of the UI, like Runbooks, to show only the attributes you’re interested in by using the column selector or attribute settings. This helps declutter the interface and focus on the most relevant data.

After executing a command (for example:

INPUT union headnode | export("INPUT_HEADNODE")

), you will see an output block like the one shown below.

_images/image75.png

Click on the “Show panel” button in the Output section. This opens an interactive interface that allows you to select the attributes you’d like to view in the UI.

_images/image76.png

Viewing Logs for Excluded Resources#

To investigate why certain resources were excluded, open the Run associated with the Runbook, and click on the “Action filter excluded resources” section within the corresponding Cell. This will open a detailed log view explaining the reasons behind each exclusion.

For example, in the below screenshot, you can see that the action filter excluded 1 out of 16 resources:

_images/image77.png

Once expanded, detailed logs will be visible, showing which resource was excluded and why, such as a version mismatch:

_images/image78.png

This log output helps in identifying failed checks, allowing targeted debugging.