GB200 Rack Firmware Update#

Rack Firmware Updates#

Overall, firmware updates using Base Command Manager (BCM) 11 software for a GB200 NVL72 rack can be done once all the GB200 compute trays, NVLink Switch trays, and power shelves are up in BCM. The latest FW/SW recipe must be followed for the installation on all devices to be successful. There are also methods to update the firmware using the standalone nvfwupdate tool that are documented here. This section provides instructions to upgrade the firmware for each major GB200 rack component (compute tray, NVLink switch, power shelf).

Note

FW packages for DGX SuperPOD are unique and different from the GB200 reference architecture package.

Reference: DGX GB200 Compute Tray Files Required for Update on DGX SuperPOD (As of BCM 11 1.2 GA)

Table 4 DGX GB200 Compute Tray Files Required for Update on DGX SuperPOD (As of BCM 11 1.2 GA)#

Component

DGX FW Recipe Version

Filename

DGX GB200 SW/FW Release Notes

1.0.00

Compute BMC bundle

nvfw_DGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg

Compute HMC bundle

nvfw_HGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg

BF3

32.44.1600

fw-Bluefield-3-rel-32_44_1600.*.bin

CX7

28.44.2506

fw-ConnectX7-rel-28_44_2506.*.bin

MFT

4.31.0-6012

mft-4.31.0-6012.*.tgz

Switch NVOS

25.02.2151

nvos-amd64-25.02.2151.bin

Switch BMC bundle

nvfw_GB200-P4978_0004.*.fwpkg

Switch BIOS bundle

nvfw_GB200-P4978_0006.*.fwpkg

Switch CPLD bundle

nvfw_GB200-P4978_0007.*.fwpkg

Switch ONIE

5.3.0013

onie-updater-x86_64.*.unsigned

Powershelf PSU

0104

NVIDIA_5500_APP_0104.*.tar

Powershelf PMC

3.1.3

common-pmc-3.1.3.*tar

GB200 compute tray firmware update—general steps

  1. Obtain the compute tray package.

  2. Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200.

  3. Confirm that in the Compute device bmcsettings, the firmware management mode is set to GB200.

  4. Check the current node’s FW versions against the update packages.

  5. Execute a dry-run to confirm the FW will update to the expected versions.

  6. Update the BMC package first (Compute BMC bundle), then the compute tray package (Compute HMC bundle). AC power-cycle the trays after each component update is complete.

NVLink Switch tray firmware update—General Steps

  1. Obtain the NVLink Switch firmware.

  2. Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw.

  3. Confirm that in the NVLink Switch device bmcsettings, the firmware management mode is set to GB200sw.

  4. Check the current NVLink Switch firmware versions against the update packages.

  5. Execute a dry-run to confirm the firmware will update to the expected version.

  6. Update the tray level firmware first in this order:

    1. BMC+FPGA+ERoT (Switch BMC bundle)

    2. CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle)

    3. SBIOS+EROT (Switch BIOS bundle)

  7. Update the NVOS from within the OS or using ZTP. (Switch NVOS)

  8. Reboot the switch trays after each component update is complete, to apply and activate the new firmware.

Note

Firmware updates for the GB200 compute trays and NVLink Switch can be done using:

  1. BCM 11 integrated firmware update manager

  2. Standalone nvfwupd tool.

Compute Tray Firmware Update Process#

Method 1—BCM/NVIDIA Mission Control integrated firmware update for compute tray#

To use the firmware update tool in BCM 11, an NVIDIA Mission Control enabled license must be registered:

  1. Place firmware update packages in the correct BCM directory.

    /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200

  2. Copy the prod-signed.fwpkg images up to the BCM head node. The files must be placed in the following directory to be visible to the firmware command.

scp <binary files> user@<head node>:/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200

Reference: BCM file directory structure for firmware updates.

/cm/local/apps/cmd/etc/htdocs/bios/firmware/

README.md b200/ gb200/ gb200sw/ gh200/ h100/ ilo/

#The gb200 folder is for compute tray firmware, the gb200sw folder is for NVLink Switch firmware
  1. Use the firmware info command in BCM to gather information on the current firmware levels of the nodes. This command provides details about the files and what their purpose is.

$ cmsh;device;firmware info
[BCM11-HEAD-01->device]% firmware info

Device        Filename                                         Component      Version                        State      Progress  Result   Size      Date
------------- ------------------------------------------------ ------------- ------------------------------ ---------- --------- -------- --------- ---------------------
BCM11-HEAD-01 nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg  GB200-BMC     DGX-GBX00_0024_250215.1.0_custom   available  N/A      -        64MiB    2025-02-15, 16:39:41
BCM11-HEAD-01 nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg      GB200-Switch  GB200-P4978_0004_250213.1.0        available  N/A      -        75MiB    2025-02-13, 10:23:28
BCM11-HEAD-01 nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg      GB200-Switch  GB200-P4978_0006_250205.1.0        available  N/A      -        16.2MiB  2025-02-05, 15:11:49
BCM11-HEAD-01 nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg GB200-Switch  GB200-P4978_0007_250121.1.2_custom available  N/A      -        1.64MiB  2025-01-21, 13:55:30
BCM11-HEAD-01 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg  GB200-Compute HGX-GBX00_0023_250223.1.1_custom   available  N/A      -        114MiB   2025-02-23, 20:20:42

Note

This will display the file names and target (such as GB200 or Switch) of all available firmware binaries. If the files do not show up with this command, they cannot be flashed by the BCM firmware manager. The officially released packages will have a common filename structure starting with nvfw_DGX-GBX00_<identifier>_<date>.

  1. Confirm GB200 Tray BMC Access/Connectivity.

    1. The BMC of each node needs to be configured in BCM. This should be done at the category level. Ensure that no bmc settings are added at the node level so that the compute trays inherit the settings from the category level.

    2. Enter cmsh and show the current BMC settings for a given node or use the category level for GB200 compute trays since all their default passwords are the same (for DGX).

  #category level
  category; use <dgx-category>;bmcsettings; show

  #device level
  device; use <device name>; bmcsettings; show

  Only use the device level to confirm that nothing has been set.

  It will show as if they have not been set before as indicated by an
  asterisk.

  [bcm11-headnode->device*[s03-p1-dgx-01-c06\*]->bmcsettings\*]%

  #Use this command to clear uncommitted changes

  refresh

c. Populate the bmcsettings fields in the dgx-gb200 category if it is not already populated.
$ cmsh;category use dgx-gb200;bmcsettings;
set username root
set password 0penBmc # Or whatever the password is
set userid 1
set firmwaremanagemode gb200
commit

Note

It is critical that the firmware management mode here is set to gb200.

  1. Test that the BMC is configured by reading the current FW component versions.

#At the specific device level

$ cmsh; device use <dgx-node-name>; firmware status

[BCM11-HEAD-01->device[s03-p1-dgx-01-c06]]% firmware status

Device                Filename                   Component                  Version                    State     Progress  Result  Size  Date
--------------------- ------------------------- -------------------------- -------------------------- --------- --------- ------- ----- -----
s03-p1-dgx-01-c06      CX7_0                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      CX7_1                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      CX7_2                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      CX7_3                     28.42.1270                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_BMC_0                  GB200Nvl-24.12-8           current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_CPLD_0                 0x00 0x0b 0x03 0x04        current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_CPLD_1                 0x00 0x0b 0x03 0x04        current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_CPLD_2                 0x00 0x10 0x01 0x0f        current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_CPLD_3                 0x00 0x10 0x01 0x0f        current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      FW_ERoT_BMC_0             01.03.0262.0000_n04        current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      Full_FW_Image_NIC_Slot_4  32.42.1000                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      Full_FW_Image_NIC_Slot_7  32.42.1000                 current                    N/A       N/A      N/A    N/A   N/A
s03-p1-dgx-01-c06      UEFI                      buildbrain-gcid-38635631   current                    N/A       N/A      N/A    N/A   N/A

#Alternatively, at the device prompt look at a specific device

cmsh; device;firmware status -n s03-p1-dgx-01-c06

#At the category level to see all of the compute tray FW in one shot

cmsh; device;firmware status -c dgx-gb200

#At the rack level

cmsh; device;firmware status -r <rack location>
  1. As a validation step prior to executing the flash operation, the dry-run option will show exactly what is changing when the firmware is flashed:

    1. Perform a flash dry-run of the BMC firmware.

      cmsh;device; firmware flash
      nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg --dry-run -n <device name>'
      
      #The <device name> can have some regex to apply the change to multiple devices simultaneously
      s03-p1-dgx-01-c0[1-2] - This will run the command against both s03-p1-dgx-01-c01 and s03-p1-dgx-01-c02
      
      #Device names can also be comma separated to run against multiple individual devices
      
      i.e. s03-p1-dgx-01-c01,s03-p1-dgx-01-c02
      
      *Example: Dry run output*
      
      Device            Component        Target           Version              Package version      Up to date       Action           Result   Error
      ----------------- ---------------- ---------------- -------------------- -------------------- ---------------- ---------------- -------- --------------------------------
      s03-p1-dgx-01-c06 HGX_FW_BMC_0     HGX_FW_BMC_0     GB200Nvl-25.01-D     GB200Nvl-25.01-E     no               install          good
      s03-p1-dgx-01-c06 HGX_FW_CPU_0     HGX_FW_CPU_0     02.03.19             02.03.20             no               install          good
      s03-p1-dgx-01-c06 HGX_FW_CPU_1     HGX_FW_CPU_1     02.03.19             02.03.20             no               install          good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_BMC_0 HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_0 HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_1 HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_0    HGX_FW_FPGA_0    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_1    HGX_FW_FPGA_1    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_GPU_0     HGX_FW_GPU_0     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_1     HGX_FW_GPU_1     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_2     HGX_FW_GPU_2     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_3     HGX_FW_GPU_3     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes            skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_0    HGX_FW_FPGA_0    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_1    HGX_FW_FPGA_1    1.20                1.20                  yes              skip             good
      s03-p1-dgx-01-c06 HGX_FW_GPU_0     HGX_FW_GPU_0     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_1     HGX_FW_GPU_1     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_2     HGX_FW_GPU_2     97.00.82.00.13      97.00.82.00.19        no               install          good
      s03-p1-dgx-01-c06 HGX_FW_GPU_3     HGX_FW_GPU_3     97.00.82.00.13      97.00.82.00.19        no               install          good
      
    2. Ensure that the components that are not up-to-date, are going to be updated to the expected package versions.

  2. Start the firmware update.

$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name>'
  1. Once the payload is uploaded to the node it will say good.

[BCM11-HEAD-01->device]% firmware flash nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg -n s03-p1-dgx-01-c[04-06]

Device              Firmware Package                                      Result
------------------- ---------------------------------------------------- -------
s03-p1-dgx-01-c04   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
s03-p1-dgx-01-c05   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
s03-p1-dgx-01-c06   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
  1. When the command completes, periodically check the status of the update until it has completed.

This will have a percentage complete while the flashing is ongoing and a complete message when the flash has finished.

$ cmsh -c 'device; firmware status -n <device name>'

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
  GB200Nvl-25.01-D flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
  02.03.19 flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
  02.03.19 flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
  97.00.82.00.13 flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
  97.00.82.00.13 flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
  97.00.82.00.13 flashing 0.0% 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
  97.00.82.00.13 flashing 0.0%

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
  GB200Nvl-25.01-D -> GB200Nvl-25.01-E pending N/A success:
  medium-specific reset or dc power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
  02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
  power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
  02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
  power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
  97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
  reset or dc power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
  97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
  reset or dc power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
  97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
  reset or dc power cycle or ac power cy+ 114MiB

  s03-p1-dgx-01-c06
  nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
  97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
  reset or dc power cycle or ac power cy+ 114MiB

At the end of the BMC update, the administrator can must activate the installed firmware, then proceed with updating other components. The success message will indicate the operation required to activate the installed firmware. * power aux_cycle: ac power cycle * power reset: medium_specific_reset or dc power cycle * bmcreset: reset bmc

  1. Activating firmware using the AC power cycle

Note

The GB200 compute tray has two levels of power. 1. The primary (system) power is the power supplied to the compute tray CPUs and GPUs. This must be powered off before the aux_cycle process. 2. The standby (AUX) power is the power that is supplied to the BMC and low-level components. Cycling standby power is an automated process that temporarily removes power from the compute tray, reinitializing all hardware components. The BMC will be unavailable for several minutes during the aux_cycle process. Once completed, the primary power can be toggled on again.

Perform the AC power cycle once both components have completed the firmware update by either of the two methods.
  1. Power Cycle Method 1—by AUX_PWR_CYCLE (Redfish)

    #From the head node, do this first to power down the system
    
    curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType": "ForceOff"}' https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
    
    #Do this next to effectively AC Power cycle (removal of auxiliary power)
    
    curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType":"AuxPowerCycle"}' https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
    
    #Use redfish to power on
    
    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "On"}' -X POST
    
    #Or use cmsh to power on the node
    
    cmsh;device;use <compute node under test>;power on
    
    #Or to do multiples
    
    cmsh;device;foreach -c dgx-gb200 (power on)
    
    #or
    
    cmsh;device;power on -c dgx-gb200 #this does all nodes in the category
    
    cmsh;device;power on -n <specific nodes>
    
  2. Power Cycle Method 2—by BCM power auxcycle command (available in 11.25.07 and later)

    #From the cmsh device context, first power off the node
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   ON    ] dgx-gb200-m06-c1
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power off
    rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
    
    #Note: if the node is still ON when the power auxcycle command is executed, you will get an error message
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   ON    ] dgx-gb200-m06-c1
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
    rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (System power is not OFF)
    
    #After the node is power OFF, then execute the power auxcycle command
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
    rf0 ...................... [AUX CYCLE]
    
    #The auxcycle will make the BMC unavailable for several minutes, therefore power status command will fail
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (Unable to establish session)
    
    #After the auxcycle process is complete, the BMC will be available again and power status command will succeed reporting the primary power is OFF
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
    rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
    
    #Finally power on the node
    [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power on
    rf0 ...................... [   ON    ] dgx-gb200-m06-c1
    

Note

In ipmitool, the Reset command performs a warm-reset which is equivalent to Ctrl-Alt-Del. The power cycle reset is the same as pressing the power button to turn the machine off, followed by pressing the power button again to turn the machine on. Keep in mind this will not activate ERoT, CPLD, or FPGA components.

  1. If issues arise, getting the debug output can help root-cause some issues. Use the flash command with debug options enabled to get debug output.

$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name> -v --debug'

Method 2—Standalone nvfwupd tool for compute tray#

If the license does not support NVIDIA Mission Control, the built-in cm-nvfwupd command will not work. * Download the standalone nvfwupd tool from the enterprise support portal. This tool can be used independent of BCM. * Or install nvfwupd package from the cuda apt repository.

  1. Get the correct firmware update packages for the update. To see the full contents of a fwupd.pkg, use the show_pkg_content command.

$ ./nvfwupd show_pkg_content -p
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
  1. Get current state of the hardware with show_version.

root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
  ip=<rf0 ip> user=root password=0penBmc servertype=GB200 show_version
  -p ./nvfw_GB200-P4972_0012_250214.1.0_custom_prod-signed.fwpkg
./nvfw_GB200-P4975_0011_250206.1.1_custom_recovery_prod-signed.fwpkg

System Model: GB200 NVL

Part number: 699-24764-0001-RC1

Serial number: 1334524170073

Packages: ['GB200-P4972_0012_250214.1.0_custom',
'GB200-P4975_0011_250206.1.1_custom_recovery']
Connection Status: Successful

Firmware Devices:

AP Name                  Sys Version              Pkg Version                Up-To-Date
-----------------------  -----------------------  -------------------------  ----------
CX7_0                    28.43.2108               N/A                        No
CX7_1                    28.43.2108               N/A                        No
CX7_2                    28.43.2108               N/A                        No
CX7_3                    28.43.2108               N/A                        No
FW_BMC_0                 GB200Nvl-25.01-D         GB200Nvl-25.01-E           No
FW_CPLD_0                0x00 0x0b 0x03 0x04      N/A                        No
FW_CPLD_1                0x00 0x0b 0x03 0x04      N/A                        No
FW_CPLD_2                0x00 0x10 0x01 0x0f      N/A                        No
FW_CPLD_3                0x00 0x10 0x01 0x0f      N/A                        No
FW_ERoT_BMC_0            01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
Full_FW_Image_NIC_Slot_4 32.43.2408               N/A                        No
Full_FW_Image_NIC_Slot_7 32.43.2408               N/A                        No
UEFI                     buildbrain-gcid-39281046 N/A                        No
HGX_FW_BMC_0             GB200Nvl-25.01-D         N/A                        No
HGX_FW_CPLD_0            0.1C                     N/A                        No
HGX_FW_CPU_0             02.03.19                 N/A                        No
HGX_FW_CPU_1             02.03.19                 N/A                        No
HGX_FW_ERoT_BMC_0        01.04.0008.0000_n04      01.03.0196.0001            Yes
HGX_FW_ERoT_CPU_0        01.04.0008.0000_n04      01.03.0196.0001            Yes
HGX_FW_ERoT_CPU_1        01.04.0008.0000_n04      01.03.0196.0001            Yes
HGX_FW_ERoT_FPGA_0       01.04.0008.0000_n04      01.03.0196.0001            Yes
HGX_FW_ERoT_FPGA_1       01.04.0008.0000_n04      01.03.0196.0001            Yes
HGX_FW_FPGA_0            1.20                     N/A                        No
HGX_FW_FPGA_1            1.20                     N/A                        No
HGX_FW_GPU_0             97.00.82.00.13           1.0.61.0                   No
HGX_FW_GPU_1             97.00.82.00.13           1.0.61.0                   No
HGX_FW_GPU_2             97.00.82.00.13           1.0.61.0                   No
HGX_FW_GPU_3             97.00.82.00.13           1.0.61.0                   No
HGX_InfoROM_GPU_0        G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_1        G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_2        G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_3        G548.0201.00.06          N/A                        No
HGX_PCIeSwitchConfig_0   01151024                 N/A                        No

-----------------------------------------------------------------------------------------------

Error Code: 0
  1. Create the payload .jsons for the BMC and the compute tray:

#Reference: UpdateBMC.json for updating BMC:

{

"Targets" :[]

}

*Reference: UpdateCompute.json for updating HGX:*

{

"Targets" :["/redfish/v1/Chassis/HGX_Chassis_0"]

}
  1. Run the BMC update first.

./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
update_fw -s UpdateBMC.json -p
./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
  1. Power off the system, then do an AUX Power cycle.

./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
activate_fw -c PWR_OFF

#wait 15 seconds

./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
activate_fw -c RF_AUX_PWR_CYCLE
  1. Check if the BMC update was successful.

Reference: Successful BMC update:

root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=<rf0 ip> user=root password=0penBmc servertype=GB200 update_fw -s
UpdateBMC.json -p
./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg

Updating ip address: ip=XXXX

FW package:
['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']

Ok to proceed with firmware update? <Y/N>

y

{"@odata.id": "/redfish/v1/TaskService/Tasks/3", "@odata.type":
"#Task.v1_4_3.Task", "Id": "3", "TaskState": "Running", "TaskStatus":
"OK"}

FW update started, Task Id: 3

Wait for Firmware Update to Start...

TaskState: Running

PercentComplete: 20

TaskStatus: OK

TaskState: Running

PercentComplete: 40

TaskStatus: OK

TaskState: Running

PercentComplete: 60

TaskStatus: OK

TaskState: Completed

PercentComplete: 100

TaskStatus: OK

Firmware update successful!

Overall Time Taken: 0:13:01

Refer to 'NVIDIA Firmware Update Document' on activation steps for new
firmware to take effect.

----------------------------------------------------------------------
Error Code: 0
  1. Do the full compute tray flash. Ensure that the system is fully up and, in its OS, to be able to do the GPU VBIOS updates.

./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
update_fw -s UpdateCompute.json -p
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
  1. Like the BMC in step 15, power down the system and then do an AUX power cycle.

  2. Power on the machine, let it provision/boot up, then check the firmware level again.

root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=10.78.194.13 user=root password=0penBmc servertype=GB200 show_version
-p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg

System Model: GB200 NVL

Part number: 692-13809-2404-RC1

Serial number: 1330125050101

Packages: ['DGX-GBX00_0024_250215.1.0_custom',
'HGX-GBX00_0023_250223.1.1_custom']

Connection Status: Successful

Firmware Devices:

AP Name                   Sys Version              Pkg Version                Up-To-Date
------------------------- ------------------------ -------------------------- ----------
CX7_0                     28.43.2108               N/A                        No
CX7_1                     28.43.2108               N/A                        No
CX7_2                     28.43.2108               N/A                        No
CX7_3                     28.43.2108               N/A                        No
FW_BMC_0                  GB200Nvl-25.01-E         GB200Nvl-25.01-E           Yes
FW_CPLD_0                 0x00 0x0b 0x03 0x04      N/A                        No
FW_CPLD_1                 0x00 0x0b 0x03 0x04      N/A                        No
FW_CPLD_2                 0x00 0x10 0x01 0x0f      N/A                        No
FW_CPLD_3                 0x00 0x10 0x01 0x0f      N/A                        No
FW_ERoT_BMC_0             01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
Full_FW_Image_NIC_Slot_4  32.43.2408               N/A                        No
Full_FW_Image_NIC_Slot_7  32.43.2408               N/A                        No
UEFI                      buildbrain-gcid-39556194 N/A                        No
HGX_FW_BMC_0              GB200Nvl-25.01-E         GB200Nvl-25.01-E           Yes
HGX_FW_CPLD_0             0.1C                     0.1C                       Yes
HGX_FW_CPU_0              02.03.20                 02.03.20                   Yes
HGX_FW_CPU_1              02.03.20                 02.03.20                   Yes
HGX_FW_ERoT_BMC_0         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
HGX_FW_ERoT_CPU_0         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
HGX_FW_ERoT_CPU_1         01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
HGX_FW_ERoT_FPGA_0        01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
HGX_FW_ERoT_FPGA_1        01.04.0008.0000_n04      01.04.0008.0000_n04        Yes
HGX_FW_FPGA_0             1.20                     1.20                       Yes
HGX_FW_FPGA_1             1.20                     1.20                       Yes
HGX_FW_GPU_0              97.00.82.00.19           97.00.82.00.19             Yes
HGX_FW_GPU_1              97.00.82.00.19           97.00.82.00.19             Yes
HGX_FW_GPU_2              97.00.82.00.19           97.00.82.00.19             Yes
HGX_FW_GPU_3              97.00.82.00.19           97.00.82.00.19             Yes
HGX_InfoROM_GPU_0         G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_1         G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_2         G548.0201.00.06          N/A                        No
HGX_InfoROM_GPU_3         G548.0201.00.06          N/A                        No
HGX_PCIeSwitchConfig_0    01151024                 N/A                        No

Applying and verifying firmware update success#

After all required firmware is installed, the compute node needs an AC cycle to fully apply the updates. This procedure can be used to bring the nodes down and back up. First connect to the GB200 tray BMC OS, then:

  1. Power off the host.

# Checks that the current status is on

curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0
\| jq '."PowerState"'



# Shuts down the OS

Graceful shutdown:

curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "GracefulShutdown"}' -X POST

Force power off:

curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "ForceOff"}' -X POST
  1. AC cycle the node.

curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
-d '{"ResetType":"AuxPowerCycleForce"}' -X POST
  1. Wait for the BMC to ping again (should take 2-3 min). Once the BMC pings, bring the host back up.

# Checks that the current status is off (if it is 'on' no further action
required)

curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0
\| jq '."PowerState"'

#Power On

curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "On"}' -X POST
  1. When the BMC and host are back up, validate that the firmware install was successful.

cmsh -c 'device; firmware status -n <device name>

Power Shelf Firmware Update Process#

There are several vendors for power shelves on DGX GB200 NVL72 system. The following instructions are for shelves made by Delta.

  1. Flash the PMC with the latest version.

The response will contain the task number.

curl -k -u admin:password -H "Content-Type: application/octet-stream" -X POST -T <FIRMWARE_FILE> https://<BMC_IP>/redfish/v1/UpdateService/update
  1. Verify that the flash is completed.

curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/TaskService/Tasks/<Task_Number>
  1. Check the PMC version.

curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Managers/SMC
  1. Complete a PSU update by flashing the PSU with the latest firmware.

    1. Repeat Steps 1 and 2 but point to the PSU firmware image in the <FIRMWARE_FILE>.

    2. Run the following command and check the PSU version and Health from the FirmwareVersion and Status/Health parameters in the output.

curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Chassis/chassis/PowerSubsystem/PowerSupplies/<PS_NUMBER>

Note

A PSU firmware update will temporarily power off the PSU, so we recommend that the rack is idle during the PSU update process.

BlueField and CX7 FW Update Process#

Prior to installation, copy the binary to the compute tray host or use a shared directory. The binary naming format should look like:

fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin

The general steps to install NVIDIA networking firmware are as follows:

  1. Start the MST service.

mst start
  1. Query the devices to find the /dev/mst paths of the devices.

mst status -v
  1. Read the current version of firmware on a given device.

flint -d /dev/mst/mt4129_pciconf0 q full
  1. Flash the firmware on the device.

    1. cd to the directory where the binary is stored.

      flint -d /dev/mst/mt4129_pciconf0 -i fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
      
    2. Repeat this for all four CX7 devices.

  2. Reset the CX7 and reboot the host.

mlxfwreset -d mlx5_0 reset

PCI devices:

------------
DEVICE_TYPE      MST                      PCI          RDMA   NET           NUMA
ConnectX7(rev:0) /dev/mst/mt4129_pciconf0 0000:03:00.0 mlx5_0 net-ibp3s0    0
ConnectX7(rev:0) /dev/mst/mt4129_pciconf1 0002:03:00.0 mlx5_1 net-ibP2p3s0  0
ConnectX7(rev:0) /dev/mst/mt4129_pciconf2 0010:03:00.0 mlx5_4 net-ibP16p3s0 1
ConnectX7(rev:0) /dev/mst/mt4129_pciconf3 0012:03:00.0 mlx5_5 net-ibP18p3s0 1
  1. For BlueField 3, the process is the same with the exception of the device being /dev/mst/mt41692.

Combined CX-7 and BlueField Update#

pdsh -g category=dgx-gb200 '/home/nvis/(dir where the firmware update is)/nicupdate.sh > /home/nvis/(dir where the firmware update is)/$(hostname)_fw_upgrade\_$(date +'%Y%m%d-%H%M%S').log'

Reference script: NIC updates (Both BF3 and CX-7) - nicupdate.sh:

#CX-7 Update

mst start

flint -d /dev/mst/mt4129_pciconf0 q full

flint -d /dev/mst/mt4129_pciconf1 q full

flint -d /dev/mst/mt4129_pciconf2 q full

flint -d /dev/mst/mt4129_pciconf3 q full

#BlueField 3 Update

flint -d /dev/mst/mt41692_pciconf0 q full

flint -d /dev/mst/mt41692_pciconf1 q full

basedir=/home/<user>/fw_0.9_releases/mellanox

bf3file=fw-BlueField-3-rel-32_43_2408-900-9D3B6-00CN-P_Ax-NVME-20.4.1-UEFI-21.4.13-UEFI-22.4.14-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin

cx7file=fw-ConnectX7-rel-28_43_2110-900-24768-0002_Ax-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin

yes \| flint -d /dev/mst/mt4129_pciconf0 -i $basedir/$cx7file b

yes \| flint -d /dev/mst/mt4129_pciconf1 -i $basedir/$cx7file b

yes \| flint -d /dev/mst/mt4129_pciconf2 -i $basedir/$cx7file b

yes \| flint -d /dev/mst/mt4129_pciconf3 -i $basedir/$cx7file b

yes \| flint -d /dev/mst/mt41692_pciconf0 -i $basedir/$bf3file b

yes \| flint -d /dev/mst/mt41692_pciconf1 -i $basedir/$bf3file b