GB200/GB300 Rack Firmware Update#

Rack Firmware Updates#

Overall, firmware updates using Base Command Manager (BCM) 11 software for a GB200/GB300 NVL72 rack can be done once all the GB200/GB300 compute trays, NVLink Switch trays, and power shelves are up in BCM. The latest FW/SW recipe must be followed for the installation on all devices to be successful. There are also methods to update the firmware using the standalone nvfwupdate tool that are documented here. This section provides instructions to upgrade the firmware for each major GB200/GB300 rack component (compute tray, NVLink switch, power shelf).

Note

FW packages for DGX SuperPOD are unique and different from the GB200 reference architecture package.

Reference: DGX GB200/GB300 Compute Tray Files Required for Update on DGX SuperPOD

The following list the general file names to expect for the DGX GB200 Compute Tray Firmware Update and NVLink Switch Firmware Update. For more information, look for the specific DGX GB200/GB300 SW/FW Release Notes on the NVIDIA Enterprise Support Portal. Specific filenames for each release can be found in the section “Multi-Node System Software Stack Package Contents”

Table 4 DGX GB200 Compute Tray and NVLink Switch Tray Files Required for Update on DGX SuperPOD#

Component

Filename

DGX GB200 SW/FW Release Notes

Compute BMC bundle

nvfw_DGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg

Compute HMC bundle

nvfw_HGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg

BF3

fw-Bluefield-3-rel-*.bin

CX7

fw-ConnectX7-rel-*.bin

Switch NVOS

nvos-amd64-*.bin

Switch BMC bundle

nvfw_GB200-P4978_0004.*.fwpkg

Switch BIOS bundle

nvfw_GB200-P4978_0006.*.fwpkg

Switch CPLD bundle

nvfw_GB200-P4978_0007.*.fwpkg

Powershelf PSU

NVIDIA_5500_APP_.*.tar

Powershelf PMC

common-pmc-3.*.tar

Firmware updates for the GB200/GB300 compute trays can be done by:

  • BCM 11 integrated firmware update tool

  • Standalone nvfwupd tool

GB200/GB300 compute tray firmware update—general steps

  1. Obtain the compute tray package.

  2. Ensure that compute tray BMC has username “admin” enabled and that the credentials are known. If username “admin” does not exist or is disabled, it must be created and enabled before the compute tray update. BCM or any rack management systems should migrate to using “admin” as the default BMC account going forward as the previously used “root” will be disabled going forward.

    Note

    The “root” username is disabled going forward.

  3. If using BCM to do the firmware update (FW):

    • Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200

    • Confirm in the compute tray bmcsettings (at the node level or category level) that the firmware management mode is set to GB200.

  4. Check the current node’s FW versions against the update packages.

  5. Execute a dry-run to confirm the FW will update to the expected versions.

  6. Update the BMC package first (Compute BMC bundle), then the compute tray package (Compute HMC bundle). AC power-cycle the trays after each component update is complete.

NVLink Switch tray firmware update—General Steps

  1. Obtain the NVLink Switch firmware.

  2. If using BCM to do the firmware update:

    • Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw.

    • Confirm that in the NVLink Switch device bmcsettings, the firmware management mode is set to GB200sw.

  3. Check the current NVLink Switch firmware versions against the update packages.

  4. Execute a dry-run to confirm the firmware will update to the expected version.

  5. Update the tray level firmware first in this order:

    • BMC+FPGA+ERoT (Switch BMC bundle)

    • CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle)

    • SBIOS+EROT (Switch BIOS bundle)

  6. Update the Switch NVOS from within the OS or using ZTP.

  7. Reboot the switch trays after each component update is complete, to apply and activate the new firmware.

Compute Tray Firmware Update Process#

The following sections provide instructions to update the firmware for the GB200/GB300 compute trays using the BCM/NVIDIA Mission Control integrated firmware update tool and the standalone nvfwupd tool.

Method 1: BCM/NVIDIA Mission Control integrated firmware update for compute tray#

To use the firmware update tool in BCM 11, an NVIDIA Mission Control enabled license must be registered:

  1. Place firmware update packages in the correct BCM directory: /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200

  2. Copy the prod-signed.fwpkg images up to the BCM head node. The files must be placed in the following directory to be visible to the firmware command.

    scp <BINARY_FILES> user@<HEAD_NODE>:/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
    

    Reference: BCM file directory structure for firmware updates.

    /cm/local/apps/cmd/etc/htdocs/bios/firmware/
    
    README.md b200/ gb200/ gb200sw/ gh200/ h100/ ilo/
    

    Note

    The gb200 folder is for compute tray firmware, the gb200sw folder is for NVLink Switch firmware.

  3. Use the firmware info command in BCM to gather information on the current firmware levels of the nodes. This command provides details about the files and what their purpose is.

    $ cmsh;device;firmware info
    [BCM11-HEAD-01->device]% firmware info
    
    Device        Filename                                         Component      Version                        State      Progress  Result   Size      Date
    ------------- ------------------------------------------------ ------------- ------------------------------ ---------- --------- -------- --------- ---------------------
    BCM11-HEAD-01 nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg  GB200-BMC     DGX-GBX00_0024_250215.1.0_custom   available  N/A      -        64MiB    2025-02-15, 16:39:41
    BCM11-HEAD-01 nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg      GB200-Switch  GB200-P4978_0004_250213.1.0        available  N/A      -        75MiB    2025-02-13, 10:23:28
    BCM11-HEAD-01 nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg      GB200-Switch  GB200-P4978_0006_250205.1.0        available  N/A      -        16.2MiB  2025-02-05, 15:11:49
    BCM11-HEAD-01 nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg GB200-Switch  GB200-P4978_0007_250121.1.2_custom available  N/A      -        1.64MiB  2025-01-21, 13:55:30
    BCM11-HEAD-01 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg  GB200-Compute HGX-GBX00_0023_250223.1.1_custom   available  N/A      -        114MiB   2025-02-23, 20:20:42
    

    Note

    This will display the file names and target (such as GB200 or Switch) of all available firmware binaries. If the files do not show up with this command, they cannot be flashed by the BCM firmware manager. The officially released packages will have a common filename structure starting with nvfw_DGX-GBX00_<IDENTIFIER>_<DATE>.

  4. Confirm GB200 Tray BMC Access/Connectivity.

    • The BMC of each node needs to be configured in BCM. This should be done at the category level. Ensure that no bmc settings are added at the node level so that the compute trays inherit the settings from the category level.

    • Enter cmsh and show the current BMC settings for a given node or use the category level for GB200 compute trays since all their default passwords are the same (for DGX).

      #category level
      category; use <dgx-category>;bmcsettings; show
      
      #device level
      device; use <device name>; bmcsettings; show
      
      Only use the device level to confirm that nothing has been set.
      
      It will show as if they have not been set before as indicated by an
      asterisk.
      
      [bcm11-headnode->device*[s03-p1-dgx-01-c06\*]->bmcsettings\*]%
      
      #Use this command to clear uncommitted changes
      
      refresh
      
    • Populate the bmcsettings fields in the dgx-gb200 category if it is not already populated.

      $ cmsh;category use dgx-gb200;bmcsettings;
      set username root # or admin if username admin is enabled
      set password <bmc password>
      set userid 1
      set firmwaremanagemode gb200
      commit
      

      Note

      It is critical that the firmware management mode here is set to gb200.

  5. Test that the BMC is configured by reading the current FW component versions.

    #At the specific device level
    
     $ cmsh; device use <dgx-node-name>; firmware status
    
     [BCM11-HEAD-01->device[s03-p1-dgx-01-c06]]% firmware status
    
     Device                Filename                   Component                  Version                    State     Progress  Result  Size  Date
     --------------------- ------------------------- -------------------------- -------------------------- --------- --------- ------- ----- -----
     s03-p1-dgx-01-c06      CX7_0                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      CX7_1                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      CX7_2                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      CX7_3                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_BMC_0                  GB200Nvl-24.12-8           current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_CPLD_0                 0x00 0x0b 0x03 0x04        current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_CPLD_1                 0x00 0x0b 0x03 0x04        current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_CPLD_2                 0x00 0x10 0x01 0x0f        current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_CPLD_3                 0x00 0x10 0x01 0x0f        current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      FW_ERoT_BMC_0             01.03.0262.0000_n04        current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      Full_FW_Image_NIC_Slot_4  32.42.1000                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      Full_FW_Image_NIC_Slot_7  32.42.1000                 current                    N/A       N/A      N/A    N/A   N/A
     s03-p1-dgx-01-c06      UEFI                      buildbrain-gcid-38635631   current                    N/A       N/A      N/A    N/A   N/A
    
     #Alternatively, at the device prompt look at a specific device
    
     cmsh; device;firmware status -n s03-p1-dgx-01-c06
    
     #At the category level to see all of the compute tray FW in one shot
    
     cmsh; device;firmware status -c dgx-gb200
    
     #At the rack level
    
     cmsh; device;firmware status -r <rack location>
    
  6. As a validation step prior to executing the flash operation, the dry-run option will show exactly what is changing when the firmware is flashed:

    • Perform a flash dry-run of the BMC firmware.

      cmsh;device; firmware flash
      nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg --dry-run -n <DEVICE_NAME>'
      
      #The <DEVICE_NAME> can have some regex to apply the change to multiple devices simultaneously
      s03-p1-dgx-01-c0[1-2] - This will run the command against both s03-p1-dgx-01-c01 and s03-p1-dgx-01-c02
      
      #Device names can also be comma separated to run against multiple individual devices
      
      i.e. s03-p1-dgx-01-c01,s03-p1-dgx-01-c02
      
      *Example: Dry run output*
      
      Device            Component        Target           Version              Package version      Up to date       Action           Result   Error
      ----------------- ---------------- ---------------- -------------------- -------------------- ---------------- ---------------- -------- --------------------------------
      s03-p1-dgx-01-c06 HGX_FW_BMC_0     HGX_FW_BMC_0     GB200Nvl-25.01-D     GB200Nvl-25.01-E     no               install          good
      s03-p1-dgx-01-c06 HGX_FW_CPU_0     HGX_FW_CPU_0     02.03.19             02.03.20             no               install          good
      s03-p1-dgx-01-c06 HGX_FW_CPU_1     HGX_FW_CPU_1     02.03.19             02.03.20             no               install          good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_BMC_0 HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_0 HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_1 HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_0    HGX_FW_FPGA_0    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_1    HGX_FW_FPGA_1    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_GPU_0     HGX_FW_GPU_0     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_1     HGX_FW_GPU_1     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_2     HGX_FW_GPU_2     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_3     HGX_FW_GPU_3     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_0    HGX_FW_FPGA_0    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_1    HGX_FW_FPGA_1    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_GPU_0     HGX_FW_GPU_0     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_1     HGX_FW_GPU_1     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_2     HGX_FW_GPU_2     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_3     HGX_FW_GPU_3     97.00.82.00.13      97.00.82.00.19        no               install          good
      
    • Ensure that the components that are not up-to-date, are going to be updated to the expected package versions.

  7. Start the firmware update.

    $ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <DEVICE_NAME>'
    
  8. Once the payload is uploaded to the node it will say good.

    [BCM11-HEAD-01->device]% firmware flash nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg -n s03-p1-dgx-01-c[04-06]
    
    Device              Firmware Package                                      Result
    ------------------- ---------------------------------------------------- -------
    s03-p1-dgx-01-c04   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    s03-p1-dgx-01-c05   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    s03-p1-dgx-01-c06   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    
  9. When the command completes, periodically check the status of the update until it has completed.

    This will have a percentage complete while the flashing is ongoing and a “complete” message when the flash has finished.

    Example output from firmware status command:
    $ cmsh -c 'device; firmware status -n <DEVICE_NAME>'
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
    GB200Nvl-25.01-D flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
    02.03.19 flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
    02.03.19 flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
    97.00.82.00.13 flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
    97.00.82.00.13 flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
    97.00.82.00.13 flashing 0.0% 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
    97.00.82.00.13 flashing 0.0%
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
    GB200Nvl-25.01-D -> GB200Nvl-25.01-E pending N/A success:
    medium-specific reset or dc power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
    02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
    power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
    02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
    power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
    97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
    reset or dc power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
    97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
    reset or dc power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
    97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
    reset or dc power cycle or ac power cy+ 114MiB
    
    s03-p1-dgx-01-c06
    nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
    97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
    reset or dc power cycle or ac power cy+ 114MiB``
    

    At the end of the BMC update, the administrator can AC power cycle the compute tray to complete the BMC update, then proceed with updating other components.

  10. Activating firmware using the AC power cycle (AUX_PWR_CYCLE)

    Note

    The GB200/GB300 compute tray has two levels of power:

    • Primary (system) power: This is the power supplied to the compute tray CPUs and GPUs. This must be powered off before the aux_cycle process.

    • Standby (AUX) power: This is the power supplied to the BMC and low-level components. Cycling standby power is an automated process that temporarily removes power from the compute tray, reinitializing all hardware components. The BMC will be unavailable for several minutes during the aux_cycle process. Once completed, the primary power can be toggled on again.

    Perform the AC power cycle once both components have completed the firmware update by either of the two methods.

Power Cycle Methods#

Two primary methods are available to perform an AC (auxiliary) power cycle of the GB200/GB300 compute tray after firmware updates:

Power Cycle Method 1: AUX_PWR_CYCLE using Redfish#

To perform the power cycle using Redfish API calls directly to the BMC:

  1. From the head node, power down the system:

    curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType": "ForceOff"}' https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
    
  2. Perform the AC power cycle (removal of auxiliary power):

    curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType":"AuxPowerCycle"}' https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
    
  3. After the cycle, power on the system using Redfish:

    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "On"}' -X POST
    

Examples: Powering on nodes using cmsh

While not part of the AC power cycle itself, the following commands can be used to power on nodes after the update process as needed:

  • Power on a single compute node:

    cmsh;device;use <compute node under test>;power on
    
  • Power on multiple nodes in a category:

    cmsh;device;foreach -c dgx-gb200 (power on)
    
  • Power on all nodes in a category:

    cmsh;device;power on -c dgx-gb200
    
  • Power on specific nodes by name:

    cmsh;device;power on -n <specific nodes>
    

Power Cycle Method 2: BCM “power auxcycle” Command (available in 11.25.08 and later)#

An AC power cycle can also be performed via the BCM command line within the device context.

  1. Ensure the node is powered off:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   ON    ] dgx-gb200-m06-c1
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power off
    rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
    

    Note

    If the node is still ON when executing the power auxcycle command, an error message will be returned:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
    rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (System power is not OFF)
    
  2. After confirming the node is OFF, perform the auxiliary power cycle:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
    rf0 ...................... [AUX CYCLE]
    
  3. During auxcycle, the BMC will be unavailable for several minutes. “power status” will indicate failure until the process is complete:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (Unable to establish session)
    
  4. When auxcycle completes, the node status will return to OFF:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
    
  5. Power on the node:

    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power on
    rf0 ...................... [   ON    ] dgx-gb200-m06-c1
    
  6. If issues arise, getting the debug output can help root-cause some issues. Use the flash command with debug options enabled to get debug output.

    $ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name> -v --debug'
    

Method 3: Standalone nvfwupd tool for compute tray#

If the license does not support NVIDIA Mission Control, the built-in cm-nvfwupd command will not work. To use the standalone nvfwupd tool, follow the steps below:

  • Download the standalone nvfwupd tool from the enterprise support portal. This tool can be used independent of BCM.

  • Or install nvfwupd package from the cuda apt repository.

  1. Get the correct firmware update packages for the update. To see the full contents of a fwupd.pkg, use the show_pkg_content command.

    $ ./nvfwupd show_pkg_content -p
    ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
    
  2. Get current state of the hardware with show_version.

     root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
      ip=<rf0 ip> user=root password=0penBmc servertype=GB200 show_version
      -p ./nvfw_GB200-P4972_0012_250214.1.0_custom_prod-signed.fwpkg
     ./nvfw_GB200-P4975_0011_250206.1.1_custom_recovery_prod-signed.fwpkg
    
     System Model: GB200 NVL
    
     Part number: 699-24764-0001-RC1
    
     Serial number: 1334524170073
    
     Packages: ['GB200-P4972_0012_250214.1.0_custom',
    'GB200-P4975_0011_250206.1.1_custom_recovery']
     Connection Status: Successful
    
     Firmware Devices:
    
     AP Name                  Sys Version              Pkg Version                Up-To-Date
     -----------------------  -----------------------  -------------------------  ----------
     CX7_0                    28.43.2108               N/A                        No
     CX7_1                    28.43.2108               N/A                        No
     CX7_2                    28.43.2108               N/A                        No
     CX7_3                    28.43.2108               N/A                        No
     FW_BMC_0                 GB200Nvl-25.01-D         GB200Nvl-25.01-E           No
     FW_CPLD_0                0x00 0x0b 0x03 0x04      N/A                        No
     FW_CPLD_1                0x00 0x0b 0x03 0x04      N/A                        No
     FW_CPLD_2                0x00 0x10 0x01 0x0f      N/A                        No
     FW_CPLD_3                0x00 0x10 0x01 0x0f      N/A                        No
     FW_ERoT_BMC_0            01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     Full_FW_Image_NIC_Slot_4 32.43.2408               N/A                        No
     Full_FW_Image_NIC_Slot_7 32.43.2408               N/A                        No
     UEFI                     buildbrain-gcid-39281046 N/A                        No
     HGX_FW_BMC_0             GB200Nvl-25.01-D         N/A                        No
     HGX_FW_CPLD_0            0.1C                     N/A                        No
     HGX_FW_CPU_0             02.03.19                 N/A                        No
     HGX_FW_CPU_1             02.03.19                 N/A                        No
     HGX_FW_ERoT_BMC_0        01.04.0008.0000_n04      01.03.0196.0001            Yes
     HGX_FW_ERoT_CPU_0        01.04.0008.0000_n04      01.03.0196.0001            Yes
     HGX_FW_ERoT_CPU_1        01.04.0008.0000_n04      01.03.0196.0001            Yes
     HGX_FW_ERoT_FPGA_0       01.04.0008.0000_n04      01.03.0196.0001            Yes
     HGX_FW_ERoT_FPGA_1       01.04.0008.0000_n04      01.03.0196.0001            Yes
     HGX_FW_FPGA_0            1.20                     N/A                        No
     HGX_FW_FPGA_1            1.20                     N/A                        No
     HGX_FW_GPU_0             97.00.82.00.13           1.0.61.0                   No
     HGX_FW_GPU_1             97.00.82.00.13           1.0.61.0                   No
     HGX_FW_GPU_2             97.00.82.00.13           1.0.61.0                   No
     HGX_FW_GPU_3             97.00.82.00.13           1.0.61.0                   No
     HGX_InfoROM_GPU_0        G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_1        G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_2        G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_3        G548.0201.00.06          N/A                        No
     HGX_PCIeSwitchConfig_0   01151024                 N/A                        No
    
     -----------------------------------------------------------------------------------------------
    
     Error Code: 0
    
  3. Create the payload .jsons for the BMC and the compute tray:

    #Reference: UpdateBMC.json for updating BMC:
    
    {
    
    "Targets" :[]
    
    }
    
    *Reference: UpdateCompute.json for updating HGX:*
    
    {
    
    "Targets" :["/redfish/v1/Chassis/HGX_Chassis_0"]
    
    }
    
  4. Run the BMC update first.

    ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
    update_fw -s UpdateBMC.json -p
    ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
    
  5. Power off the system, then do an AUX Power cycle.

    ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
    activate_fw -c PWR_OFF
    
    #wait 15 seconds
    
    ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
    activate_fw -c RF_AUX_PWR_CYCLE
    
  6. Check if the BMC update was successful.

    Reference: Successful BMC update:

    root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
    ip=<rf0 ip> user=root password=0penBmc servertype=GB200 update_fw -s
    UpdateBMC.json -p
    ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
    
    Updating ip address: ip=XXXX
    
    FW package:
    ['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']
    
    Ok to proceed with firmware update? <Y/N>
    
    y
    
    {"@odata.id": "/redfish/v1/TaskService/Tasks/3", "@odata.type":
    "#Task.v1_4_3.Task", "Id": "3", "TaskState": "Running", "TaskStatus":
    "OK"}
    
    FW update started, Task Id: 3
    
    Wait for Firmware Update to Start...
    
    TaskState: Running
    
    PercentComplete: 20
    
    TaskStatus: OK
    
    TaskState: Running
    
    PercentComplete: 40
    
    TaskStatus: OK
    
    TaskState: Running
    
    PercentComplete: 60
    
    TaskStatus: OK
    
    TaskState: Completed
    
    PercentComplete: 100
    
    TaskStatus: OK
    
    Firmware update successful!
    
    Overall Time Taken: 0:13:01
    
    Refer to 'NVIDIA Firmware Update Document' on activation steps for new firmware to take effect.
    
    ----------------------------------------------------------------------
    Error Code: 0
    
  7. Do the full compute tray flash. Ensure that the system is fully up and, in its OS, to be able to do the GPU VBIOS updates.

    ./nvfwupd -t ip=<rf0 ip> user=admin password=<bmc password> servertype=GB200
    update_fw -s UpdateCompute.json -p
    ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
    
  8. Like the BMC in step 15, power down the system and then do an AUX power cycle.

  9. Power on the machine, let it provision/boot up, then check the firmware level again.

    Example output from firmware show version command:
    root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
    ip=10.78.194.13 user=admin password=<bmc password> servertype=GB200 show_version
    -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
    ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
    
    System Model: GB200 NVL
    
    Part number: 692-13809-2404-RC1
    
    Serial number: 1330125050101
    
    Packages: ['DGX-GBX00_0024_250215.1.0_custom', 'HGX-GBX00_0023_250223.1.1_custom']
    
    Connection Status: Successful
    
    Firmware Devices:
    
    AP Name                   Sys Version              Pkg Version                Up-To-Date
    ------------------------- ------------------------ -------------------------- ----------
     CX7_0                     28.43.2108               N/A                        No
     CX7_1                     28.43.2108               N/A                        No
     CX7_2                     28.43.2108               N/A                        No
     CX7_3                     28.43.2108               N/A                        No
     FW_BMC_0                  GB200Nvl-25.01-E         GB200Nvl-25.01-E           Yes
     FW_CPLD_0                 0x00 0x0b 0x03 0x04      N/A                        No
     FW_CPLD_1                 0x00 0x0b 0x03 0x04      N/A                        No
     FW_CPLD_2                 0x00 0x10 0x01 0x0f      N/A                        No
     FW_CPLD_3                 0x00 0x10 0x01 0x0f      N/A                        No
     FW_ERoT_BMC_0             01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     Full_FW_Image_NIC_Slot_4  32.43.2408               N/A                        No
     Full_FW_Image_NIC_Slot_7  32.43.2408               N/A                        No
     UEFI                      buildbrain-gcid-39556194 N/A                        No
     HGX_FW_BMC_0              GB200Nvl-25.01-E         GB200Nvl-25.01-E           Yes
     HGX_FW_CPLD_0             0.1C                     0.1C                       Yes
     HGX_FW_CPU_0              02.03.20                 02.03.20                   Yes
     HGX_FW_CPU_1              02.03.20                 02.03.20                   Yes
     HGX_FW_ERoT_BMC_0         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     HGX_FW_ERoT_CPU_0         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     HGX_FW_ERoT_CPU_1         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     HGX_FW_ERoT_FPGA_0        01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     HGX_FW_ERoT_FPGA_1        01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
     HGX_FW_FPGA_0             1.20                     1.20                       Yes
     HGX_FW_FPGA_1             1.20                     1.20                       Yes
     HGX_FW_GPU_0              97.00.82.00.19           97.00.82.00.19             Yes
     HGX_FW_GPU_1              97.00.82.00.19           97.00.82.00.19             Yes
     HGX_FW_GPU_2              97.00.82.00.19           97.00.82.00.19             Yes
     HGX_FW_GPU_3              97.00.82.00.19           97.00.82.00.19             Yes
     HGX_InfoROM_GPU_0         G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_1         G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_2         G548.0201.00.06          N/A                        No
     HGX_InfoROM_GPU_3         G548.0201.00.06          N/A                        No
     HGX_PCIeSwitchConfig_0    01151024                 N/A                        No
    

Applying and verifying firmware update success#

After all required firmware is installed, the compute node needs an AC cycle to fully apply the updates. This procedure can be used to bring the nodes down and back up. First connect to the GB200 tray BMC OS, then:

  1. Power off the host.

    # Checks that the current status is on
    
    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0
    \| jq '."PowerState"'
    
    # Shuts down the OS
    
    Graceful shutdown:
    
    curl -k -u ${USER}:${PASS}
    https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
    -d '{"ResetType": "GracefulShutdown"}' -X POST
    
    Force power off:
    
    curl -k -u ${USER}:${PASS}
    https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
    -d '{"ResetType": "ForceOff"}' -X POST
    
  2. AC cycle the node.

    curl -k -u ${USER}:${PASS}
    https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
    -d '{"ResetType":"AuxPowerCycleForce"}' -X POST
    
  3. Wait for the BMC to ping again (should take 2-3 min). Once the BMC pings, bring the host back up.

    # Checks that the current status is off (if it is 'on' no further action required)
    
    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0 | jq '."PowerState"'
    
    #Power On
    
    curl -k -u ${USER}:${PASS}
    https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
    -d '{"ResetType": "On"}' -X POST
    
  4. When the BMC and host are back up, validate that the firmware install was successful.

    $ cmsh -c 'device; firmware status -n <DEVICE_NAME>'
    

Power Shelf Firmware Update Process#

There are several vendors for power shelves on DGX GB200/GB300 NVL72 system(s). The following instructions are for shelves made by Delta.

  1. Flash the PMC with the latest version.

    The response will contain the task number.

    $ curl -k -u admin:password -H "Content-Type: application/octet-stream" -X POST -T <FIRMWARE_FILE> https://<BMC_IP>/redfish/v1/UpdateService/update
    
  2. Verify that the flash is completed.

    $ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/TaskService/Tasks/<TASK_NUMBER>
    
  3. Check the PMC version.

    $ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Managers/PMC_0
    
  4. Complete a PSU update by flashing the PSU with the latest firmware.

    $ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Chassis/chassis/PowerSubsystem/PowerSupplies/<PS_NUMBER>
    
    • Repeat Steps 1 and 2, but use the PSU firmware image in the <FIRMWARE_FILE>.

    • Run the following command to check the PSU version and Health from the FirmwareVersion and Status and Health parameters in the output.

    Note

    A PSU firmware update will temporarily power off the PSU, so it is recommended that the rack is idle during the PSU update process.

BlueField and CX7 FW Update Process#

Prior to installation, copy the binary to the compute tray host or use a shared directory. The binary naming format should look like:

$ fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin

The general steps to install NVIDIA networking firmware are as follows:

  1. Start the MST service.

    $ mst start
    
  2. Query the devices to find the /dev/mst paths of the devices.

    $ mst status -v
    
  3. Read the current version of firmware on a given device.

    $ flint -d /dev/mst/mt4129_pciconf0 q full
    
  4. Flash the firmware on the device.

    • Change to the directory where the firmware binary is stored.

      $ flint -d /dev/mst/mt4129_pciconf0 -i fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
      
    • Repeat this for all four CX7 devices.

  5. Reset the CX7 and reboot the host.

    $ mlxfwreset -d mlx5_0 reset
    
    PCI devices:
    
    ------------
    DEVICE_TYPE      MST                      PCI          RDMA   NET           NUMA
    ConnectX7(rev:0) /dev/mst/mt4129_pciconf0 0000:03:00.0 mlx5_0 net-ibp3s0    0
    ConnectX7(rev:0) /dev/mst/mt4129_pciconf1 0002:03:00.0 mlx5_1 net-ibP2p3s0  0
    ConnectX7(rev:0) /dev/mst/mt4129_pciconf2 0010:03:00.0 mlx5_4 net-ibP16p3s0 1
    ConnectX7(rev:0) /dev/mst/mt4129_pciconf3 0012:03:00.0 mlx5_5 net-ibP18p3s0 1
    
  6. For BlueField 3, the process is the same with the exception of the device being /dev/mst/mt41692.

Combined CX-7 and BlueField Update#

$ pdsh -g category=dgx-gb200 '/home/nvis/(dir where the firmware update is)/nicupdate.sh > /home/nvis/(dir where the firmware update is)/$(hostname)_fw_upgrade\_$(date +'%Y%m%d-%H%M%S').log'

Reference script: NIC updates (Both BF3 and CX-7) - nicupdate.sh:

# CX-7 Update

$ mst start
$ flint -d /dev/mst/mt4129_pciconf0 q full
$ flint -d /dev/mst/mt4129_pciconf1 q full
$ flint -d /dev/mst/mt4129_pciconf2 q full
$ flint -d /dev/mst/mt4129_pciconf3 q full

# BlueField 3 Update

$ flint -d /dev/mst/mt41692_pciconf0 q full
$ flint -d /dev/mst/mt41692_pciconf1 q full
$ basedir=/home/<USERNAME>/fw_0.9_releases/mellanox
$ bf3file=fw-BlueField-3-rel-32_43_2408-900-9D3B6-00CN-P_Ax-NVME-20.4.1-UEFI-21.4.13-UEFI-22.4.14-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin
$ cx7file=fw-ConnectX7-rel-28_43_2110-900-24768-0002_Ax-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin

  yes \| flint -d /dev/mst/mt4129_pciconf0 -i $basedir/$cx7file b

  yes \| flint -d /dev/mst/mt4129_pciconf1 -i $basedir/$cx7file b

  yes \| flint -d /dev/mst/mt4129_pciconf2 -i $basedir/$cx7file b

  yes \| flint -d /dev/mst/mt4129_pciconf3 -i $basedir/$cx7file b

  yes \| flint -d /dev/mst/mt41692_pciconf0 -i $basedir/$bf3file b

  yes \| flint -d /dev/mst/mt41692_pciconf1 -i $basedir/$bf3file b

DGX OS Update#

Compatible drivers and software packages need to be installed to align with the new firmware.

  1. Clone OS image in BCM.

  2. Boot one node with new image.

  3. Install MFT, DOCA, NVIDIA driver package.

# make sure the external repo is pointed to for doca packages

 $ cat /etc/apt/sources.list.d/doca.source

 Types: deb URIs: https://linux.mellanox.com/public/repo/doca/DGX_GBxx_latest_DOCA/ubuntu24.04/arm64-sbsa/ Suites: / Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg

 # Install doca package
 $ sudo apt-get update
 $ sudo apt install doca-all

 # Install driver package
 $ sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-570.158.01_1.0-1_arm64.deb
 $ sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.158.01/nvidia-driver-local-5778B6CA-keyring.gpg /usr/share/keyrings/
 $ sudo mv /etc/apt/sources.list.d/cuda-compute-repo.sources /etc/apt/sources.list.d/cuda-compute-repo.sources.disabled
 $ sudo apt update
 $ sudo apt install nvidia-driver-570-open
 $ sudo apt-get install nvidia-imex-570
 $ sudo apt-get install nvidia-fabricmanager-570
 $ sudo apt-get install libnvidia-nscq-570

 # Check doca packages
 $ sudo dpkg -l | grep 2.10.0-093520

 # Check driver package
 $ sudo dpkg -l | grep 570.158
  1. Save changes into the image.

  2. Reboot compute node into new image with AUTO install.

  3. Set all nodes to boot from new image and reboot.

Operational Security Requirements#

Factory Reset after Debug Token Usage

After using one (or more) Debug Token on the compute tray, the operator must remove the Debug Token and factory reset the non-volatile storage of the BMC, HMC and CPU. The following RedFish APIs provide the factory reset functionality:

Resetting the HMC R/W filesystem:

curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/HGX_BMC_0/Actions/Manager.ResetToDefaults -d '{"ResetToDefaultsType": "ResetAll"}'

Erasing the HMC eMMC:

curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/HGX_BMC_0/Actions/Oem/eMMC.SecureErase

Erasing the Grace “R/W” SPI flashes (Perform on both Grace modules):

curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Chassis/HGX_ProcessorModule_0/Actions/Oem/NvidiaProcessor.VariableSpiErase
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Chassis/HGX_ProcessorModule_1/Actions/Oem/NvidiaProcessor.VariableSpiErase

Resetting the BCM R/W filesystem:

curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/BMC_0/Actions/Manager.ResetToDefaults -d '{"ResetToDefaultsType": "ResetAll"}'

Erasing the BMC eMMC:

curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/BMC_0/Actions/Oem/eMMC.SecureErase