Compute Tray Firmware Update Process#

Method 1 - BCM/NVIDIA Mission Control Integrated Firmware Update for Compute Tray#

To use the firmware update tool in BCM 11 an NVIDIA Mission Control enabled license must be registered.

  1. Place Firmware Update Packages in the Correct BCM Directory /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200

  2. Copy the prod-signed.fwpkg images up to the BCM head node. The files must be placed in the following directory to be visible to the ‘firmware’ command:

    scp <binary files> user@<headnode>:/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
    

    Reference: BCM file directory structure for firmware updates.

    /cm/local/apps/cmd/etc/htdocs/bios/firmware/
    
    README.md b200/ gb200/ gb200sw/ gh200/ h100/ ilo/
    
    # The gb200 folder is for compute tray firmware, the gb200sw folder is
    # for NVLink Switch firmware
    
  3. Use the firmware info command in BCM to gather information on the current FW levels of the nodes. It will detail the files and what their purpose is.

  4. Use the firmware info command in BCM to gather information on the current FW levels of the nodes. It will detail the files and what their purpose is.

    cmsh;device;firmware info
    
    [T06-HEAD-01->device]% firmware info
    
    Device        Filename                                             Component      Version                        State      Progress Result   Size     Date
    ------------- --------------------------------------------------- ------------- ------------------------------ ---------- -------- -------- -------- ---------------------
    T06-HEAD-01   nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg GB200-BMC   DGX-GBX00_0024_250215.1.0_custom available  N/A     64MiB    2025-02-15, 16:39:41
    T06-HEAD-01   nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg       GB200-Switch GB200-P4978_0004_250213.1.0   available  N/A     75MiB    2025-02-13, 10:23:28
    T06-HEAD-01   nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg       GB200-Switch GB200-P4978_0006_250205.1.0   available  N/A     16.2MiB  2025-02-05, 15:11:49
    T06-HEAD-01   nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg GB200-Switch GB200-P4978_0007_250121.1.2_custom available  N/A     1.64MiB  2025-01-21, 13:55:30
    T06-HEAD-01   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg  GB200-Compute HGX-GBX00_0023_250223.1.1_custom available  N/A     114MiB   2025-02-23, 20:20:42
    

    Note

    This will display the file names and target (such as GB200 or Switch) of all available firmware binaries. If the files do not show up with this command, they cannot be flashed by the update tool. The officially released packages will have a common filename structure starting with nvfw_DGX-GBX00_<identifier>_<date>.

  5. Confirm GB200/GB300 Tray BMC Access/Connectivity.

    1. The BMC of each node needs to be configured in BCM. This should be done at the category level. Ensure that no bmcsettings are added at the node level so that the compute trays inherit the settings from the category level.

    2. Enter cmsh and show the current BMC settings for a given node or use the category level for GB200 compute trays since all of their default passwords are the same (for DGX).

      #category level
      category; use <dgx-category>;bmcsettings; show
      
      #device level
      device; use <device name>; bmcsettings; show
      

      Only use the device level to confirm that nothing has been set.

      It will show as if they have not been set before as indicated by an asterisk.

      [bcm11-headnode->device*[a08-p1-dgx-04-c18\*]->bmcsettings\*]%
      
      #use this command to clear uncommitted changes
      refresh
      
    3. Populate the bmcsettings fields in the dgx-gb200 category if it is not already populated.

      cmsh;category use dgx-gb200;bmcsettings;
      set username admin
      set password <Password of choice>
      set userid 1
      set firmwaremanagemode gb200
      commit
      

      Note

      It is critical that the firmware management mode here is set to gb200.

    4. Test that the BMC is configured by reading the current FW versions.

      #at the device level
      cmsh; device use <dgx-node-name>; firmware status
      
      [maple->device[dgx-gb200-m07-c1]]% firmware status
      
      Device Filename Component Version State Progress Result Size Date
      ----------------- --------------------------------
      dgx-gb200-m07-c1 CX7_0 28.42.1270 current N/A N/A
      dgx-gb200-m07-c1 CX7_1 28.42.1270 current N/A N/A
      dgx-gb200-m07-c1 CX7_2 28.42.1270 current N/A N/A
      dgx-gb200-m07-c1 CX7_3 28.42.1270 current N/A N/A
      dgx-gb200-m07-c1 FW_BMC_0 GB200Nvl-24.12-8 current N/A N/A
      dgx-gb200-m07-c1 FW_CPLD_0 0x00 0x0b 0x03 0x04 current N/A N/A
      dgx-gb200-m07-c1 FW_CPLD_1 0x00 0x0b 0x03 0x04 current N/A N/A
      dgx-gb200-m07-c1 FW_CPLD_2 0x00 0x10 0x01 0x0f current N/A N/A
      dgx-gb200-m07-c1 FW_CPLD_3 0x00 0x10 0x01 0x0f current N/A N/A
      dgx-gb200-m07-c1 FW_ERoT_BMC_0 01.03.0262.0000_n04 current N/A N/A
      dgx-gb200-m07-c1 Full_FW_Image_NIC_Slot_4 32.42.1000 current N/A N/A
      dgx-gb200-m07-c1 Full_FW_Image_NIC_Slot_7 32.42.1000 current N/A N/A
      dgx-gb200-m07-c1 UEFI buildbrain-gcid-38635631 current N/A N/A
      
      #At the category level to see all of the compute tray FW in one shot
      cmsh; device;firmware -c dgx-gb200 status
      
      #At the rack level
      cmsh; device;firmware -r <rack location> status
      
  6. As a validation step prior to executing the flash, a dry-run command is supported to show exactly what will be changing when the firmware is applied:

    1. Perform a dry run of the BMC FW

      cmsh;device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg --dry-run -n <device name>
      

      The <device name> can have some regex to apply the change to multiple devices simultaneously:

      • dgx-gb200-r1-c[1-2] - This will run the command against both dgx-gb200-r1-c1 and dgx-gb200-r1-c2

      • Device names can also be comma separated to run against multiple individual devices: dgx-gb200-r1-c1,dgx-gb200-r1-c2

      Example: Dry run output

      s03-p1-dgx-01-c06 HGX_FW_BMC_0 HGX_FW_BMC_0 GB200Nvl-25.01-D GB200Nvl-25.01-E no install good
      s03-p1-dgx-01-c06 HGX_FW_CPLD_0 HGX_FW_CPLD_0 0.1C 0.1C yes skip good
      s03-p1-dgx-01-c06 HGX_FW_CPU_0 HGX_FW_CPU_0 02.03.19 02.03.20 no install good
      s03-p1-dgx-01-c06 HGX_FW_CPU_1 HGX_FW_CPU_1 02.03.19 02.03.20 no install good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_BMC_0 HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_0 HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_1 HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_0 HGX_FW_FPGA_0 1.20 1.20 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_FPGA_1 HGX_FW_FPGA_1 1.20 1.20 yes skip good
      s03-p1-dgx-01-c06 HGX_FW_GPU_0 HGX_FW_GPU_0 97.00.82.00.13 97.00.82.00.19 no install good
      s03-p1-dgx-01-c06 HGX_FW_GPU_1 HGX_FW_GPU_1 97.00.82.00.13 97.00.82.00.19 no install good
      s03-p1-dgx-01-c06 HGX_FW_GPU_2 HGX_FW_GPU_2 97.00.82.00.13 97.00.82.00.19 no install good
      s03-p1-dgx-01-c06 HGX_FW_GPU_3 HGX_FW_GPU_3 97.00.82.00.13 97.00.82.00.19 no install good
      
    2. Ensure that the values that are going to be updated are the expected versions.

  7. Start the firmware update.

    cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_250614.1.0_custom_prod-signed.fwpkg -n <device name>'
    
  8. Once the payload is uploaded to the node it will say good.

    [T06-HEAD-01->device]% firmware flash nvfw_DGX-GBX00_0023_250614.1.0_custom_prod-signed.fwpkg -n s03-p1-dgx-01-c{04..06}
    
    Device              flashing file                                         Result
    ------------------  ---------------------------------------------------- --------
    s03-p1-dgx-01-c04   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    s03-p1-dgx-01-c05   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    s03-p1-dgx-01-c06   nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
    
  9. When the command completes, check the status of the update until it has completed. This will have a percentage complete while the flashing is ongoing and a complete message when the flash has finished.

      cmsh -c 'device; firmware status -n <device name>'
    
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0 GB200Nvl-25.01-D flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0 02.03.19 flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1 02.03.19 flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0 97.00.82.00.13 flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1 97.00.82.00.13 flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2 97.00.82.00.13 flashing 0.0% 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3 97.00.82.00.13 flashing 0.0%
    
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0 GB200Nvl-25.01-D -> GB200Nvl-25.01-E pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0 02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1 02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
      s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB
    
    At the end of the BMC update, the administrator can AC Cycle the GB200
    node(s) to complete the BMC update, then proceed with updating other
    components.
    

    Note

    It is important to AC/AUX cycle the target host after the CPLD and BMC updates because the BMC has limited memory and cannot store another firmware package. AC cycling clears the memory and applies changes, allowing the HMC update to proceed successfully.

  10. Do the AC Cycle after each .fwpkg completed the firmware update

    Note

    The GB200/GB300 compute tray has two levels of power:

    • Primary (system) power: This is the power supplied to the compute tray CPUs and GPUs. This must be powered off before the aux_cycle process.

    • Standby (AUX) power: This is the power supplied to the BMC and low-level components. Cycling standby power is an automated process that temporarily removes power from the compute tray, reinitializing all hardware components. The BMC will be unavailable for several minutes during the aux_cycle process. Once completed, the primary power can be toggled on again.

    Power Cycle Method 1: AUX_PWR_CYCLE using Redfish

    To perform the power cycle using Redfish API calls directly to the BMC:

    1. From the head node, power down the system:

      curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST \
          -d '{"ResetType": "ForceOff"}' \
          https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
      
    2. Perform the AC power cycle (removal of auxiliary power):

      curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST \
          -d '{"ResetType":"AuxPowerCycle"}' \
          https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
      
    3. After the cycle, power on the system using Redfish:

      curl -k -u ${USER}:${PASS} \
          https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset \
          -d '{"ResetType": "On"}' -X POST
      

    Examples: Powering on nodes using cmsh

    While not part of the AC power cycle itself, the following commands can be used to power on nodes after the update process as needed:

    • Power on a single compute node:

      cmsh;device;use <compute node under test>;power on
      
    • Power on multiple nodes in a category:

      cmsh;device;foreach -c dgx-gb200 (power on)
      
    • Power on all nodes in a category:

      cmsh;device;power on -c dgx-gb200
      
    • Power on specific nodes by name:

      cmsh;device;power on -n <specific nodes>
      

    Power Cycle Method 2: BCM “power auxcycle” Command (available in BCM 11.25.08 and later)

    An AC power cycle can also be performed via the BCM command line within the device context.

    1. Ensure the node is powered off:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
      rf0 ...................... [   ON    ] dgx-gb200-m06-c1
      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power off
      rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
      

      Note

      If the node is still ON when executing the power auxcycle command, an error message will be returned:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
      rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (System power is not OFF)
      
    2. After confirming the node is OFF, perform the auxiliary power cycle:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle
      rf0 ...................... [AUX CYCLE]
      
    3. During auxcycle, the BMC will be unavailable for several minutes. “power status” will indicate failure until the process is complete:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
      rf0 ...................... [  FAILED ] dgx-gb200-m06-c1 (Unable to establish session)
      
    4. When auxcycle completes, the node status will return to OFF:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status
      rf0 ...................... [   OFF   ] dgx-gb200-m06-c1
      
    5. Power on the node:

      [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power on
      rf0 ...................... [   ON    ] dgx-gb200-m06-c1
      
  11. If issues arise, getting the debug output can help root cause some issues. Use the flash command with debug options enabled to get debug output

    cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name> -v --debug'
    

Method 2 - Stand Alone nvfwupd Tool for Compute Tray#

If the license does not support NVIDIA Mission Control, the built in cm-nvfwupd will not work. Download the latest standalone nvfwupd tool from Enterprise Portal - v2.0.7 or later: Announcement: nvfwupd tool version tool or method is used independent of BCM.

Note

These instructions only cover the update of a single compute tray. The stand-alone tool supports simultaneous upgrades for multiple systems, and multiple components like the compute trays and NVLink switches together. Please refer to Chapters 17, 18 and 19 of the NVIDIA Firmware Update Guide that is included with the nvfwupdate tool.

Get the correct FW update packages for update. To see the full contents of a fwupd.pkg, use the show_pkg_content command.

./nvfwupd show_pkg_content -p
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg

Get current state of the hardware with show_version.

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} \
    servertype=GB200 show_version -p ./nvfw_GB200-P4972_0012_250214.1.0_custom_prod-signed.fwpkg \
    ./nvfw_GB200-P4975_0011_250206.1.1_custom_recovery_prod-signed.fwpkg

System Model: GB200 NVL

Part number: 699-24764-0001-RC1

Serial number: 1334524170073

Packages: ['GB200-P4972_0012_250214.1.0_custom', 'GB200-P4975_0011_250206.1.1_custom_recovery']

Connection Status: Successful

Firmware Devices:

AP Name                Sys Version             Pkg Version                Up-To-Date
---------------------- ---------------------- -------------------------- ----------
CX7_0                  28.43.2108             N/A                        No
CX7_1                  28.43.2108             N/A                        No
CX7_2                  28.43.2108             N/A                        No
CX7_3                  28.43.2108             N/A                        No
FW_BMC_0               GB200Nvl-25.01-D       GB200Nvl-25.01-E           No
FW_CPLD_0              0x00 0x0b 0x03 0x04    N/A                        No
FW_CPLD_1              0x00 0x0b 0x03 0x04    N/A                        No
FW_CPLD_2              0x00 0x10 0x01 0x0f    N/A                        No
FW_CPLD_3              0x00 0x10 0x01 0x0f    N/A                        No
FW_ERoT_BMC_0          01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
Full_FW_Image_NIC_Slot_4 32.43.2408           N/A                        No
Full_FW_Image_NIC_Slot_7 32.43.2408           N/A                        No
UEFI                   buildbrain-gcid-39281046 N/A                      No
HGX_FW_BMC_0           GB200Nvl-25.01-D       N/A                        No
HGX_FW_CPLD_0          0.1C                   N/A                        No
HGX_FW_CPU_0           02.03.19               N/A                        No
HGX_FW_CPU_1           02.03.19               N/A                        No
HGX_FW_ERoT_BMC_0      01.04.0008.0000_n04    01.03.0196.0001            Yes
HGX_FW_ERoT_CPU_0      01.04.0008.0000_n04    01.03.0196.0001            Yes
HGX_FW_ERoT_CPU_1      01.04.0008.0000_n04    01.03.0196.0001            Yes
HGX_FW_ERoT_FPGA_0     01.04.0008.0000_n04    01.03.0196.0001            Yes
HGX_FW_ERoT_FPGA_1     01.04.0008.0000_n04    01.03.0196.0001            Yes
HGX_FW_FPGA_0          1.20                   N/A                        No
HGX_FW_FPGA_1          1.20                   N/A                        No
HGX_FW_GPU_0           97.00.82.00.13         1.0.61.0                   No
HGX_FW_GPU_1           97.00.82.00.13         1.0.61.0                   No
HGX_FW_GPU_2           97.00.82.00.13         1.0.61.0                   No
HGX_FW_GPU_3           97.00.82.00.13         1.0.61.0                   No
HGX_InfoROM_GPU_0      G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_1      G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_2      G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_3      G548.0201.00.06        N/A                        No
HGX_PCIeSwitchConfig_0 01151024               N/A                        No
------------------------------------------------------------------------------------
Error Code: 0

Create payload .jsons for the bmc and the compute tray

Reference: UpdateBMC.json

{
    "Targets": []
}

Reference: UpdateCompute.json

{
    "Targets": ["/redfish/v1/Chassis/HGX_Chassis_0"]
}

Run the BMC update first.

./nvfwupd -t ip=<rf0 ip> user=$USER password=$PASSWORD servertype=GB200
update_fw -s BMC_Full.json -p
./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg

Power off the system, then do an AC Cycle.

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200
activate_fw -c PWR_OFF

# wait 15 seconds

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200
activate_fw -c RF_AUX_PWR_CYCLE

Check if the BMC update was successful.

Reference: Successful BMC update.

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200 update_fw -s ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200 update_fw -s BMC_Full.json -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg

Updating ip address: ip=XXXX

FW package:
['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']

Updating ip address: ip=XXXX

FW package:
['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']

Ok to proceed with firmware update? <Y/N>

y

FW package:
['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']

Ok to proceed with firmware update? <Y/N>

y

{"@odata.id": "/redfish/v1/TaskService/Tasks/3", "@odata.type":
 "#Task.v1_4_3.Task", "Id": "3", "TaskState": "Running", "TaskStatus":
 "OK"}

FW update started, Task Id: 3

Wait for Firmware Update to Start...

TaskState: Running

PercentComplete: 20

TaskStatus: OK

TaskState: Running

PercentComplete: 40

TaskStatus: OK

TaskState: Running

PercentComplete: 60

TaskStatus: OK

TaskState: Completed

PercentComplete: 100

TaskStatus: OK

Firmware update successful!

Overall Time Taken: 0:13:01

Refer to ‘NVIDIA Firmware Update Document’ on activation steps for new firmware to take effect.

Do the full compute tray flash (HGX). Ensure that the system is fully up and, in its OS, to be able to do the GPU VBIOS updates.

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200 update_fw -s Compute_Full.json -p ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg

Like the BMC, power down the system and then do an AUX power cycle.

Power on the machine, let it provision/boot up, then check the firmware level again

./nvfwupd -t ip=<rf0 ip> user=${USER} password=${PASS} servertype=GB200 show_version -p ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg ip=10.78.194.13 user=root password=0penBmc servertype=GB200 show_version -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg

System Model: GB200 NVL

Part number: 692-13809-2404-RC1

Serial number: 1330125050101

Packages: ['DGX-GBX00_0024_250215.1.0_custom',
'HGX-GBX00_0023_250223.1.1_custom']

Connection Status: Successful

Firmware Devices:

AP Name                  Sys Version             Pkg Version                Up-To-Date
------------------------ ---------------------- -------------------------- ----------
CX7_0                    28.43.2108             N/A                        No
CX7_1                    28.43.2108             N/A                        No
CX7_2                    28.43.2108             N/A                        No
CX7_3                    28.43.2108             N/A                        No
FW_BMC_0                 GB200Nvl-25.01-E       GB200Nvl-25.01-E           Yes
FW_CPLD_0                0x00 0x0b 0x03 0x04    N/A                        No
FW_CPLD_1                0x00 0x0b 0x03 0x04    N/A                        No
FW_CPLD_2                0x00 0x10 0x01 0x0f    N/A                        No
FW_CPLD_3                0x00 0x10 0x01 0x0f    N/A                        No
FW_ERoT_BMC_0            01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
Full_FW_Image_NIC_Slot_4 32.43.2408             N/A                        No
Full_FW_Image_NIC_Slot_7 32.43.2408             N/A                        No
UEFI                     buildbrain-gcid-39556194 N/A                      No
HGX_FW_BMC_0             GB200Nvl-25.01-E       GB200Nvl-25.01-E           Yes
HGX_FW_CPLD_0            0.1C                   0.1C                       Yes
HGX_FW_CPU_0             02.03.20               02.03.20                   Yes
HGX_FW_CPU_1             02.03.20               02.03.20                   Yes
HGX_FW_ERoT_BMC_0        01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
HGX_FW_ERoT_CPU_0        01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
HGX_FW_ERoT_CPU_1        01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
HGX_FW_ERoT_FPGA_0       01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
HGX_FW_ERoT_FPGA_1       01.04.0008.0000_n04    01.04.0008.0000_n04        Yes
HGX_FW_FPGA_0            1.20                   1.20                       Yes
HGX_FW_FPGA_1            1.20                   1.20                       Yes
HGX_FW_GPU_0             97.00.82.00.19         97.00.82.00.19             Yes
HGX_FW_GPU_1             97.00.82.00.19         97.00.82.00.19             Yes
HGX_FW_GPU_2             97.00.82.00.19         97.00.82.00.19             Yes
HGX_FW_GPU_3             97.00.82.00.19         97.00.82.00.19             Yes
HGX_InfoROM_GPU_0        G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_1        G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_2        G548.0201.00.06        N/A                        No
HGX_InfoROM_GPU_3        G548.0201.00.06        N/A                        No
HGX_PCIeSwitchConfig_0   01151024               N/A                        No

Applying and Verifying Firmware Update Success#

First connect to the GB200 tray BMC OS, then:

  1. Power off the host.

    # Checks that the current status is on
    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0 | jq '."PowerState"'
    
    # Shuts down the OS
    # Graceful shutdown:
    curl -k -u ${USER}:${PASS} \
        https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset \
        -d '{"ResetType": "GracefulShutdown"}' -X POST
    
    # Force power off:
    curl -k -u ${USER}:${PASS} \
        https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset \
        -d '{"ResetType": "ForceOff"}' -X POST
    
  2. AC (AUX) cycle the node.

    curl -k -u ${USER}:${PASS} \
        https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset \
        -d '{"ResetType":"AuxPowerCycleForce"}' -X POST
    
  3. Wait for the BMC to ping again (should take 2-3 min). Once the BMC pings, bring the host back up.

    # Checks that the current status is off (if it is 'on' no further action required)
    curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0 | jq '."PowerState"'
    
    # Power On
    curl -k -u ${USER}:${PASS} \
        https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset \
        -d '{"ResetType": "On"}' -X POST
    
  4. When the BMC and host are back up, validate that the firmware install was successful.

    cmsh -c 'device; firmware status -n <device name>'