GB200 Rack Firmware Update#
Rack Firmware Updates#
Overall, firmware updates using Base Command Manager (BCM) 11 software for a GB200 NVL72 rack can be done once all the GB200 compute trays, NVLink Switch trays, and power shelves are up in BCM. The latest FW/SW recipe must be followed for the installation on all devices to be successful. There are also methods to update the firmware using the standalone nvfwupdate
tool that are documented here. This section provides instructions to upgrade the firmware for each major GB200 rack component (compute tray, NVLink switch, power shelf).
Note
FW packages for DGX SuperPOD are unique and different from the GB200 reference architecture package.
Reference: DGX GB200 Compute Tray Files Required for Update on DGX SuperPOD (As of BCM 11 1.2 GA)
Component |
DGX FW Recipe Version |
Filename |
---|---|---|
DGX GB200 SW/FW Release Notes |
1.0.00 |
|
Compute BMC bundle |
nvfw_DGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg |
|
Compute HMC bundle |
nvfw_HGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg |
|
BF3 |
32.44.1600 |
fw-Bluefield-3-rel-32_44_1600.*.bin |
CX7 |
28.44.2506 |
fw-ConnectX7-rel-28_44_2506.*.bin |
MFT |
4.31.0-6012 |
mft-4.31.0-6012.*.tgz |
Switch NVOS |
25.02.2151 |
nvos-amd64-25.02.2151.bin |
Switch BMC bundle |
nvfw_GB200-P4978_0004.*.fwpkg |
|
Switch BIOS bundle |
nvfw_GB200-P4978_0006.*.fwpkg |
|
Switch CPLD bundle |
nvfw_GB200-P4978_0007.*.fwpkg |
|
Switch ONIE |
5.3.0013 |
onie-updater-x86_64.*.unsigned |
Powershelf PSU |
0104 |
NVIDIA_5500_APP_0104.*.tar |
Powershelf PMC |
3.1.3 |
common-pmc-3.1.3.*tar |
GB200 compute tray firmware update—general steps
Obtain the compute tray package.
Place the files in
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
.Confirm that in the Compute device bmcsettings, the firmware management mode is set to GB200.
Check the current node’s FW versions against the update packages.
Execute a dry-run to confirm the FW will update to the expected versions.
Update the BMC package first (Compute BMC bundle), then the compute tray package (Compute HMC bundle). AC power-cycle the trays after each component update is complete.
NVLink Switch tray firmware update—General Steps
Obtain the NVLink Switch firmware.
Place the files in
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw
.Confirm that in the NVLink Switch device bmcsettings, the firmware management mode is set to GB200sw.
Check the current NVLink Switch firmware versions against the update packages.
Execute a dry-run to confirm the firmware will update to the expected version.
Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle)
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle)
SBIOS+EROT (Switch BIOS bundle)
Update the NVOS from within the OS or using ZTP. (Switch NVOS)
Reboot the switch trays after each component update is complete, to apply and activate the new firmware.
Note
Firmware updates for the GB200 compute trays and NVLink Switch can be done using:
BCM 11 integrated firmware update manager
Standalone
nvfwupd
tool.
Compute Tray Firmware Update Process#
Method 1—BCM/NVIDIA Mission Control integrated firmware update for compute tray#
To use the firmware update tool in BCM 11, an NVIDIA Mission Control enabled license must be registered:
Place firmware update packages in the correct BCM directory.
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
Copy the prod-signed.fwpkg images up to the BCM head node. The files must be placed in the following directory to be visible to the firmware command.
scp <binary files> user@<head node>:/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
Reference: BCM file directory structure for firmware updates.
/cm/local/apps/cmd/etc/htdocs/bios/firmware/
README.md b200/ gb200/ gb200sw/ gh200/ h100/ ilo/
#The gb200 folder is for compute tray firmware, the gb200sw folder is for NVLink Switch firmware
Use the firmware info command in BCM to gather information on the current firmware levels of the nodes. This command provides details about the files and what their purpose is.
$ cmsh;device;firmware info
[BCM11-HEAD-01->device]% firmware info
Device Filename Component Version State Progress Result Size Date
------------- ------------------------------------------------ ------------- ------------------------------ ---------- --------- -------- --------- ---------------------
BCM11-HEAD-01 nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg GB200-BMC DGX-GBX00_0024_250215.1.0_custom available N/A - 64MiB 2025-02-15, 16:39:41
BCM11-HEAD-01 nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg GB200-Switch GB200-P4978_0004_250213.1.0 available N/A - 75MiB 2025-02-13, 10:23:28
BCM11-HEAD-01 nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg GB200-Switch GB200-P4978_0006_250205.1.0 available N/A - 16.2MiB 2025-02-05, 15:11:49
BCM11-HEAD-01 nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg GB200-Switch GB200-P4978_0007_250121.1.2_custom available N/A - 1.64MiB 2025-01-21, 13:55:30
BCM11-HEAD-01 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg GB200-Compute HGX-GBX00_0023_250223.1.1_custom available N/A - 114MiB 2025-02-23, 20:20:42
Note
This will display the file names and target (such as GB200 or Switch) of all available firmware binaries. If the files do not show up with this command, they cannot be flashed by the BCM firmware manager. The officially released packages will have a common filename structure starting with nvfw_DGX-GBX00_<identifier>_<date>.
Confirm GB200 Tray BMC Access/Connectivity.
The BMC of each node needs to be configured in BCM. This should be done at the category level. Ensure that no bmc settings are added at the node level so that the compute trays inherit the settings from the category level.
Enter cmsh and show the current BMC settings for a given node or use the category level for GB200 compute trays since all their default passwords are the same (for DGX).
#category level
category; use <dgx-category>;bmcsettings; show
#device level
device; use <device name>; bmcsettings; show
Only use the device level to confirm that nothing has been set.
It will show as if they have not been set before as indicated by an
asterisk.
[bcm11-headnode->device*[s03-p1-dgx-01-c06\*]->bmcsettings\*]%
#Use this command to clear uncommitted changes
refresh
c. Populate the bmcsettings fields in the dgx-gb200 category if it is not already populated.
$ cmsh;category use dgx-gb200;bmcsettings;
set username root
set password 0penBmc # Or whatever the password is
set userid 1
set firmwaremanagemode gb200
commit
Note
It is critical that the firmware management mode here is set to gb200.
Test that the BMC is configured by reading the current FW component versions.
#At the specific device level
$ cmsh; device use <dgx-node-name>; firmware status
[BCM11-HEAD-01->device[s03-p1-dgx-01-c06]]% firmware status
Device Filename Component Version State Progress Result Size Date
--------------------- ------------------------- -------------------------- -------------------------- --------- --------- ------- ----- -----
s03-p1-dgx-01-c06 CX7_0 28.42.1270 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 CX7_1 28.42.1270 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 CX7_2 28.42.1270 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 CX7_3 28.42.1270 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_BMC_0 GB200Nvl-24.12-8 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_CPLD_0 0x00 0x0b 0x03 0x04 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_CPLD_1 0x00 0x0b 0x03 0x04 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_CPLD_2 0x00 0x10 0x01 0x0f current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_CPLD_3 0x00 0x10 0x01 0x0f current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 FW_ERoT_BMC_0 01.03.0262.0000_n04 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 Full_FW_Image_NIC_Slot_4 32.42.1000 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 Full_FW_Image_NIC_Slot_7 32.42.1000 current N/A N/A N/A N/A N/A
s03-p1-dgx-01-c06 UEFI buildbrain-gcid-38635631 current N/A N/A N/A N/A N/A
#Alternatively, at the device prompt look at a specific device
cmsh; device;firmware status -n s03-p1-dgx-01-c06
#At the category level to see all of the compute tray FW in one shot
cmsh; device;firmware status -c dgx-gb200
#At the rack level
cmsh; device;firmware status -r <rack location>
As a validation step prior to executing the flash operation, the dry-run option will show exactly what is changing when the firmware is flashed:
Perform a flash dry-run of the BMC firmware.
cmsh;device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg --dry-run -n <device name>' #The <device name> can have some regex to apply the change to multiple devices simultaneously s03-p1-dgx-01-c0[1-2] - This will run the command against both s03-p1-dgx-01-c01 and s03-p1-dgx-01-c02 #Device names can also be comma separated to run against multiple individual devices i.e. s03-p1-dgx-01-c01,s03-p1-dgx-01-c02 *Example: Dry run output* Device Component Target Version Package version Up to date Action Result Error ----------------- ---------------- ---------------- -------------------- -------------------- ---------------- ---------------- -------- -------------------------------- s03-p1-dgx-01-c06 HGX_FW_BMC_0 HGX_FW_BMC_0 GB200Nvl-25.01-D GB200Nvl-25.01-E no install good s03-p1-dgx-01-c06 HGX_FW_CPU_0 HGX_FW_CPU_0 02.03.19 02.03.20 no install good s03-p1-dgx-01-c06 HGX_FW_CPU_1 HGX_FW_CPU_1 02.03.19 02.03.20 no install good s03-p1-dgx-01-c06 HGX_FW_ERoT_BMC_0 HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_0 HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_1 HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_0 HGX_FW_FPGA_0 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_1 HGX_FW_FPGA_1 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_GPU_0 HGX_FW_GPU_0 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_1 HGX_FW_GPU_1 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_2 HGX_FW_GPU_2 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_3 HGX_FW_GPU_3 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_0 HGX_FW_FPGA_0 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_1 HGX_FW_FPGA_1 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_GPU_0 HGX_FW_GPU_0 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_1 HGX_FW_GPU_1 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_2 HGX_FW_GPU_2 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_3 HGX_FW_GPU_3 97.00.82.00.13 97.00.82.00.19 no install good
Ensure that the components that are not up-to-date, are going to be updated to the expected package versions.
Start the firmware update.
$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name>'
Once the payload is uploaded to the node it will say good.
[BCM11-HEAD-01->device]% firmware flash nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg -n s03-p1-dgx-01-c[04-06]
Device Firmware Package Result
------------------- ---------------------------------------------------- -------
s03-p1-dgx-01-c04 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
s03-p1-dgx-01-c05 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
When the command completes, periodically check the status of the update until it has completed.
This will have a percentage complete while the flashing is ongoing and a complete message when the flash has finished.
$ cmsh -c 'device; firmware status -n <device name>'
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
GB200Nvl-25.01-D flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
02.03.19 flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
02.03.19 flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
97.00.82.00.13 flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
97.00.82.00.13 flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
97.00.82.00.13 flashing 0.0% 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
97.00.82.00.13 flashing 0.0%
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0
GB200Nvl-25.01-D -> GB200Nvl-25.01-E pending N/A success:
medium-specific reset or dc power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0
02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1
02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc
power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0
97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
reset or dc power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1
97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
reset or dc power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2
97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
reset or dc power cycle or ac power cy+ 114MiB
s03-p1-dgx-01-c06
nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3
97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific
reset or dc power cycle or ac power cy+ 114MiB
At the end of the BMC update, the administrator can must activate the installed firmware, then proceed with updating other components. The success message will indicate the operation required to activate the installed firmware. * power aux_cycle: ac power cycle * power reset: medium_specific_reset or dc power cycle * bmcreset: reset bmc
Activating firmware using the AC power cycle
Note
The GB200 compute tray has two levels of power. 1. The primary (system) power is the power supplied to the compute tray CPUs and GPUs. This must be powered off before the aux_cycle process. 2. The standby (AUX) power is the power that is supplied to the BMC and low-level components. Cycling standby power is an automated process that temporarily removes power from the compute tray, reinitializing all hardware components. The BMC will be unavailable for several minutes during the aux_cycle process. Once completed, the primary power can be toggled on again.
- Perform the AC power cycle once both components have completed the firmware update by either of the two methods.
Power Cycle Method 1—by AUX_PWR_CYCLE (Redfish)
#From the head node, do this first to power down the system curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType": "ForceOff"}' https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset #Do this next to effectively AC Power cycle (removal of auxiliary power) curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType":"AuxPowerCycle"}' https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset #Use redfish to power on curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "On"}' -X POST #Or use cmsh to power on the node cmsh;device;use <compute node under test>;power on #Or to do multiples cmsh;device;foreach -c dgx-gb200 (power on) #or cmsh;device;power on -c dgx-gb200 #this does all nodes in the category cmsh;device;power on -n <specific nodes>
Power Cycle Method 2—by BCM power auxcycle command (available in 11.25.07 and later)
#From the cmsh device context, first power off the node [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ ON ] dgx-gb200-m06-c1 [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power off rf0 ...................... [ OFF ] dgx-gb200-m06-c1 #Note: if the node is still ON when the power auxcycle command is executed, you will get an error message [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ ON ] dgx-gb200-m06-c1 [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle rf0 ...................... [ FAILED ] dgx-gb200-m06-c1 (System power is not OFF) #After the node is power OFF, then execute the power auxcycle command [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ OFF ] dgx-gb200-m06-c1 [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle rf0 ...................... [AUX CYCLE] #The auxcycle will make the BMC unavailable for several minutes, therefore power status command will fail [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ FAILED ] dgx-gb200-m06-c1 (Unable to establish session) #After the auxcycle process is complete, the BMC will be available again and power status command will succeed reporting the primary power is OFF [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ OFF ] dgx-gb200-m06-c1 #Finally power on the node [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power on rf0 ...................... [ ON ] dgx-gb200-m06-c1
Note
In ipmitool, the Reset command performs a warm-reset which is equivalent to Ctrl-Alt-Del. The power cycle reset is the same as pressing the power button to turn the machine off, followed by pressing the power button again to turn the machine on. Keep in mind this will not activate ERoT, CPLD, or FPGA components.
If issues arise, getting the debug output can help root-cause some issues. Use the flash command with debug options enabled to get debug output.
$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name> -v --debug'
Method 2—Standalone nvfwupd tool for compute tray#
If the license does not support NVIDIA Mission Control, the built-in cm-nvfwupd
command will not work.
* Download the standalone nvfwupd
tool from the enterprise support portal. This tool can be used independent of BCM.
* Or install nvfwupd
package from the cuda apt repository.
Get the correct firmware update packages for the update. To see the full contents of a fwupd.pkg, use the show_pkg_content command.
$ ./nvfwupd show_pkg_content -p
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
Get current state of the hardware with show_version.
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=<rf0 ip> user=root password=0penBmc servertype=GB200 show_version
-p ./nvfw_GB200-P4972_0012_250214.1.0_custom_prod-signed.fwpkg
./nvfw_GB200-P4975_0011_250206.1.1_custom_recovery_prod-signed.fwpkg
System Model: GB200 NVL
Part number: 699-24764-0001-RC1
Serial number: 1334524170073
Packages: ['GB200-P4972_0012_250214.1.0_custom',
'GB200-P4975_0011_250206.1.1_custom_recovery']
Connection Status: Successful
Firmware Devices:
AP Name Sys Version Pkg Version Up-To-Date
----------------------- ----------------------- ------------------------- ----------
CX7_0 28.43.2108 N/A No
CX7_1 28.43.2108 N/A No
CX7_2 28.43.2108 N/A No
CX7_3 28.43.2108 N/A No
FW_BMC_0 GB200Nvl-25.01-D GB200Nvl-25.01-E No
FW_CPLD_0 0x00 0x0b 0x03 0x04 N/A No
FW_CPLD_1 0x00 0x0b 0x03 0x04 N/A No
FW_CPLD_2 0x00 0x10 0x01 0x0f N/A No
FW_CPLD_3 0x00 0x10 0x01 0x0f N/A No
FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
Full_FW_Image_NIC_Slot_4 32.43.2408 N/A No
Full_FW_Image_NIC_Slot_7 32.43.2408 N/A No
UEFI buildbrain-gcid-39281046 N/A No
HGX_FW_BMC_0 GB200Nvl-25.01-D N/A No
HGX_FW_CPLD_0 0.1C N/A No
HGX_FW_CPU_0 02.03.19 N/A No
HGX_FW_CPU_1 02.03.19 N/A No
HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.03.0196.0001 Yes
HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.03.0196.0001 Yes
HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.03.0196.0001 Yes
HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.03.0196.0001 Yes
HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.03.0196.0001 Yes
HGX_FW_FPGA_0 1.20 N/A No
HGX_FW_FPGA_1 1.20 N/A No
HGX_FW_GPU_0 97.00.82.00.13 1.0.61.0 No
HGX_FW_GPU_1 97.00.82.00.13 1.0.61.0 No
HGX_FW_GPU_2 97.00.82.00.13 1.0.61.0 No
HGX_FW_GPU_3 97.00.82.00.13 1.0.61.0 No
HGX_InfoROM_GPU_0 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_1 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_2 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_3 G548.0201.00.06 N/A No
HGX_PCIeSwitchConfig_0 01151024 N/A No
-----------------------------------------------------------------------------------------------
Error Code: 0
Create the payload .jsons for the BMC and the compute tray:
#Reference: UpdateBMC.json for updating BMC:
{
"Targets" :[]
}
*Reference: UpdateCompute.json for updating HGX:*
{
"Targets" :["/redfish/v1/Chassis/HGX_Chassis_0"]
}
Run the BMC update first.
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
update_fw -s UpdateBMC.json -p
./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
Power off the system, then do an AUX Power cycle.
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
activate_fw -c PWR_OFF
#wait 15 seconds
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
activate_fw -c RF_AUX_PWR_CYCLE
Check if the BMC update was successful.
Reference: Successful BMC update:
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=<rf0 ip> user=root password=0penBmc servertype=GB200 update_fw -s
UpdateBMC.json -p
./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
Updating ip address: ip=XXXX
FW package:
['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg']
Ok to proceed with firmware update? <Y/N>
y
{"@odata.id": "/redfish/v1/TaskService/Tasks/3", "@odata.type":
"#Task.v1_4_3.Task", "Id": "3", "TaskState": "Running", "TaskStatus":
"OK"}
FW update started, Task Id: 3
Wait for Firmware Update to Start...
TaskState: Running
PercentComplete: 20
TaskStatus: OK
TaskState: Running
PercentComplete: 40
TaskStatus: OK
TaskState: Running
PercentComplete: 60
TaskStatus: OK
TaskState: Completed
PercentComplete: 100
TaskStatus: OK
Firmware update successful!
Overall Time Taken: 0:13:01
Refer to 'NVIDIA Firmware Update Document' on activation steps for new
firmware to take effect.
----------------------------------------------------------------------
Error Code: 0
Do the full compute tray flash. Ensure that the system is fully up and, in its OS, to be able to do the GPU VBIOS updates.
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200
update_fw -s UpdateCompute.json -p
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
Like the BMC in step 15, power down the system and then do an AUX power cycle.
Power on the machine, let it provision/boot up, then check the firmware level again.
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=10.78.194.13 user=root password=0penBmc servertype=GB200 show_version
-p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
System Model: GB200 NVL
Part number: 692-13809-2404-RC1
Serial number: 1330125050101
Packages: ['DGX-GBX00_0024_250215.1.0_custom',
'HGX-GBX00_0023_250223.1.1_custom']
Connection Status: Successful
Firmware Devices:
AP Name Sys Version Pkg Version Up-To-Date
------------------------- ------------------------ -------------------------- ----------
CX7_0 28.43.2108 N/A No
CX7_1 28.43.2108 N/A No
CX7_2 28.43.2108 N/A No
CX7_3 28.43.2108 N/A No
FW_BMC_0 GB200Nvl-25.01-E GB200Nvl-25.01-E Yes
FW_CPLD_0 0x00 0x0b 0x03 0x04 N/A No
FW_CPLD_1 0x00 0x0b 0x03 0x04 N/A No
FW_CPLD_2 0x00 0x10 0x01 0x0f N/A No
FW_CPLD_3 0x00 0x10 0x01 0x0f N/A No
FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
Full_FW_Image_NIC_Slot_4 32.43.2408 N/A No
Full_FW_Image_NIC_Slot_7 32.43.2408 N/A No
UEFI buildbrain-gcid-39556194 N/A No
HGX_FW_BMC_0 GB200Nvl-25.01-E GB200Nvl-25.01-E Yes
HGX_FW_CPLD_0 0.1C 0.1C Yes
HGX_FW_CPU_0 02.03.20 02.03.20 Yes
HGX_FW_CPU_1 02.03.20 02.03.20 Yes
HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
HGX_FW_FPGA_0 1.20 1.20 Yes
HGX_FW_FPGA_1 1.20 1.20 Yes
HGX_FW_GPU_0 97.00.82.00.19 97.00.82.00.19 Yes
HGX_FW_GPU_1 97.00.82.00.19 97.00.82.00.19 Yes
HGX_FW_GPU_2 97.00.82.00.19 97.00.82.00.19 Yes
HGX_FW_GPU_3 97.00.82.00.19 97.00.82.00.19 Yes
HGX_InfoROM_GPU_0 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_1 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_2 G548.0201.00.06 N/A No
HGX_InfoROM_GPU_3 G548.0201.00.06 N/A No
HGX_PCIeSwitchConfig_0 01151024 N/A No
Applying and verifying firmware update success#
After all required firmware is installed, the compute node needs an AC cycle to fully apply the updates. This procedure can be used to bring the nodes down and back up. First connect to the GB200 tray BMC OS, then:
Power off the host.
# Checks that the current status is on
curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0
\| jq '."PowerState"'
# Shuts down the OS
Graceful shutdown:
curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "GracefulShutdown"}' -X POST
Force power off:
curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "ForceOff"}' -X POST
AC cycle the node.
curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset
-d '{"ResetType":"AuxPowerCycleForce"}' -X POST
Wait for the BMC to ping again (should take 2-3 min). Once the BMC pings, bring the host back up.
# Checks that the current status is off (if it is 'on' no further action
required)
curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0
\| jq '."PowerState"'
#Power On
curl -k -u ${USER}:${PASS}
https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
-d '{"ResetType": "On"}' -X POST
When the BMC and host are back up, validate that the firmware install was successful.
cmsh -c 'device; firmware status -n <device name>
NVLink Switch Firmware Update Process#
For the NVLink Switch, the firmware updates consist of the firmware of the switch itself and the NVOS software.
NVLink Switch tray assumptions#
Non-scale-out design (NVL72x1)—all NVLink ports are connected to MN-NVLink cable cartridge.
All tray interfaces are set to receive IPs through DHCP.
All steps in Chapter 2, 3, and 4 have been completed.
The rack inventory import process or manual entry process must be completed, and all switch entries must appear in the cmsh devices list.
Example: NVLink Switch BCM switch device list
root@BCM11-HEAD-01:~# cmsh -c "device; list -t switch -f
hostname:15,mac:20,ip:12,status:11 \|grep -i nvsw "
S03-P1-NVSW-01 E0:9D:73:F0:4C:DE 10.78.195.1 [ UP ]+
S03-P1-NVSW-02 E0:9D:73:3F:EB:28 10.78.195.2 [ UP ]+
S03-P1-NVSW-03 E0:9D:73:3F:E7:30 10.78.195.3 [ UP ]+
S03-P1-NVSW-04 E0:9D:73:3F:EA:C8 10.78.195.4 [ UP ]+
S03-P1-NVSW-05 E0:9D:73:3F:E4:F0 10.78.195.5 [ UP ]+
S03-P1-NVSW-06 E0:9D:73:3F:E2:C8 10.78.195.6 [ UP ]+
S03-P1-NVSW-07 E0:9D:73:3F:E2:50 10.78.195.7 [ UP ]+
S03-P1-NVSW-08 E0:9D:73:3F:E5:18 10.78.195.8 [ UP ]+
S03-P1-NVSW-09 E0:9D:73:3F:E4:F8 10.78.195.9 [ UP ]+
S04-P1-NVSW-01 E0:9D:73:F0:41:4E 10.78.195.31 [ UP ]+
S04-P1-NVSW-02 E0:9D:73:F0:59:16 10.78.195.32 [ UP ]+
S04-P1-NVSW-03 E0:9D:73:F0:41:8E 10.78.195.33 [ UP ]+
S04-P1-NVSW-04 E0:9D:73:F0:41:36 10.78.195.34 [ UP ]+
S04-P1-NVSW-05 E0:9D:73:F0:41:A6 10.78.195.35 [ UP ]+
S04-P1-NVSW-06 E0:9D:73:F0:45:36 10.78.195.36 [ UP ]+
S04-P1-NVSW-07 E0:9D:73:F0:4D:7E 10.78.195.37 [ UP ]+
S04-P1-NVSW-08 E0:9D:73:F0:3D:56 10.78.195.38 [ UP ]+
S04-P1-NVSW-09 E0:9D:73:F0:4D:B6 10.78.195.39 [ UP ]+
Note
For switches, the cm-lite daemon needs to be up and running for the switch to appear as [UP].
Example: NVLink Switch BCM switch information
[BCM11-HEAD-01->device[a05-p1-nvsw-01]]% show
+--------------------------+---------------------------------------------+
| Parameter | Value |
+==========================+=============================================+
| Hostname | a05-p1-nvsw-01 |
+--------------------------+---------------------------------------------+
| IP | 7.241.3.1 |
+--------------------------+---------------------------------------------+
| Network | ipminet2 |
+--------------------------+---------------------------------------------+
| Revision | |
+--------------------------+---------------------------------------------+
| Type | Switch |
+--------------------------+---------------------------------------------+
| Mac | E0:9D:73:3F:E0:50 |
+--------------------------+---------------------------------------------+
| Model | |
+--------------------------+---------------------------------------------+
| Ports | 0 |
+--------------------------+---------------------------------------------+
| Kind | nvlink |
+--------------------------+---------------------------------------------+
| Control script | |
+--------------------------+---------------------------------------------+
| Control script timeout | 5 |
+--------------------------+---------------------------------------------+
| SNMP Settings | <submode> |
+--------------------------+---------------------------------------------+
| Lowest port | 1 |
+--------------------------+---------------------------------------------+
| Uplinks | |
+--------------------------+---------------------------------------------+
| Disable port detection | yes |
+--------------------------+---------------------------------------------+
| Disable port mapping | no |
+--------------------------+---------------------------------------------+
| Activation | Sun, 23 Feb 2025 12:55:30 PST |
+--------------------------+---------------------------------------------+
| Rack | A05:19 |
+--------------------------+---------------------------------------------+
| Chassis | < not set > |
+--------------------------+---------------------------------------------+
| Access Settings | <submode> |
+--------------------------+---------------------------------------------+
| Priority | 0 |
+--------------------------+---------------------------------------------+
| VLAN cache time | 5m |
+--------------------------+---------------------------------------------+
| Has client daemon | yes |
+--------------------------+---------------------------------------------+
| ZTP Settings | <submode> |
+--------------------------+---------------------------------------------+
| Subnet manager | no |
+--------------------------+---------------------------------------------+
| Disable SNMP | yes |
+--------------------------+---------------------------------------------+
| GUID | 00000000-0000-0000-0000-000000000000 |
+--------------------------+---------------------------------------------+
| Services | <0 in submode> |
+--------------------------+---------------------------------------------+
| NV configuration mode | AUTO |
+--------------------------+---------------------------------------------+
| Members | |
+--------------------------+---------------------------------------------+
| Management network | ipminet2 |
+--------------------------+---------------------------------------------+
| Power control | rf0 |
+--------------------------+---------------------------------------------+
| Custom power script | |
+--------------------------+---------------------------------------------+
| Custom power script arg | |
+--------------------------+---------------------------------------------+
| Power distribution units | |
+--------------------------+---------------------------------------------+
| Default gateway metric | 0 |
+--------------------------+---------------------------------------------+
| Switch ports | |
+--------------------------+---------------------------------------------+
| Interfaces | <3 in submode> |
+--------------------------+---------------------------------------------+
| BMC Settings | <submode> |
+--------------------------+---------------------------------------------+
| Userdefined1 | |
+--------------------------+---------------------------------------------+
| Userdefined2 | |
+--------------------------+---------------------------------------------+
| User defined resources | |
+--------------------------+---------------------------------------------+
| Supports GNSS | no |
+--------------------------+---------------------------------------------+
| Custom ping script | |
+--------------------------+---------------------------------------------+
| Custom ping script arg | |
+--------------------------+---------------------------------------------+
| Partition base | |
+--------------------------+---------------------------------------------+
| Part number | |
+--------------------------+---------------------------------------------+
| Serial number | |
+--------------------------+---------------------------------------------+
| Notes | <0B> |
+--------------------------+---------------------------------------------+
| Prometheus metric | <0 in submode> |
| forwarders | |
+--------------------------+---------------------------------------------+
Example: BCM NVLink Switch interfaces output
[BCM11-HEAD-01->device[B05-P1-NVSW-01]->interfaces]% list
Type Network device name IP Network Start if
------------ -------------------- ---------------- ---------------- --------
bmc rf0 7.241.5.21 ipminet3 always
physical eth0 7.241.5.1 ipminet3 always
physical eth1 7.241.5.11 ipminet3 always
All NVLink Switches per rack are reachable by its BMC and COMe0/COMe1 port IP address:
Copper connections confirmed.
Speed/Bandwidth (200G for COMe0 and COMe1).
IP Address assigned by BCM to the COMe0 and COMe1 network (ipminetx).
Logical connectivity (access):
SSH to NVLink switch BMC can be done (default user/pass = root/JulietBmc@123)
SSH to NVOS on each NVLink switch can be done (default user/pass = admin/Juliet1234).
Note
If the NVLink Switch has any issues and the default NVOS password above is not working, try admin/admin.
Method 1—automatic NVOS update and firmware update (supported in BCM 11 GA)#
For DGX SuperPOD configurations, this is the method enabled by BCM network automation.
Setup ZTP in BCM and enable ZTP on the NVLink Switches themselves.
Restart the NVLink Switch trays to do ZTP that upgrades the NVOS and runs a separate script to do the tray level firmware updates.
Method 2—BCM/NVIDIA Mission Control firmware update integrated process for NVLink Switch#
Get a summary of the firmware update files uploaded to BCM via the
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw
directory. If none exist, upload the flash files to that directory.Verify the files with this command
cmsh -c 'device; firmware info'
. Ensure that all the files show up with the GB200-Switch designation.
Example: Firmware update file list for NVLink Switch devices:
cmsh;device;firmware info \| grep -i GB200-Switch
#Or get it from the individual node entry
[BCM11-HEAD-01->device[a05-p1-nvsw-09]]% firmware info
Device Filename Component Version State Progress Result Size Date
----------------
---------------------------------------------------------
---------------- ----------------------------------- ---------- --------
-------- -------- ---------------------
BCM11-HEAD-01 nvfw_GB200-P4978_0000_250213.1.0_dbg-signed.fwpkg
GB200-Switch GB200-P4978\_0000\_250213.1.0 available N/A 71MiB
2025-02-13, 10:05:51
BCM11-HEAD-01 nvfw_GB200-P4978_0002_250205.1.0_dbg-signed.fwpkg
GB200-Switch GB200-P4978\_0002\_250205.1.0 available N/A 16.2MiB
2025-02-05, 15:49:59
BCM11-HEAD-01 nvfw_GB200-P4978_0003_250121.1.2_custom_dbg-signed.fwpkg
GB200-Switch GB200-P4978\_0003\_250121.1.2_custom available N/A
1.64MiB 2025-01-21, 13:55:25
Use the firmware status command from the BCM device submenu to find the current firmware levels of the NVLink Switch.
Example: Firmware status command from BCM
#Do for individual node
[BCM11-HEAD-01->device]% firmware status -n a05-p1-nvsw-09
#Do for all nodes
[BCM11-HEAD-01->device]% firmware status -t switch \| grep -i nvsw
#Can also pull at the rack level if desired
[BCM11-HEAD-01->device]% firmware status -r <rack location> \| grep -i
nvsw
Example: Firmware status command output
Device Filename Component Version State Progress Result Size Date
---------------- -------------------------------- ----------------
-------------------- -------- -------- -------- -------- --------
a05-p1-nvsw-09 ASIC 35.2014.1698 current N/A N/A
a05-p1-nvsw-09 BIOS 0ACTV_00.01.012 current N/A N/A
a05-p1-nvsw-09 BMC 88.0002.0956 current N/A N/A
a05-p1-nvsw-09 CPLD1 CPLD000370_REV0500 current N/A N/A
a05-p1-nvsw-09 CPLD2 CPLD000377_REV0800 current N/A N/A
a05-p1-nvsw-09 CPLD3 CPLD000373_REV0800 current N/A N/A
a05-p1-nvsw-09 CPLD4 CPLD000390_REV0300 current N/A N/A
a05-p1-nvsw-09 EROT 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 EROT-ASIC1 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 EROT-ASIC2 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 EROT-BMC 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 EROT-CPU 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 EROT-FPGA 01.04.0018.0000_n04 current N/A N/A
a05-p1-nvsw-09 FPGA 0.1A current N/A N/A
a05-p1-nvsw-09 SSD CE00A400 current N/A N/A
a05-p1-nvsw-09 transceiver N/A current N/A N/A
Ensure that all NVLink Switch BMCs have their firmware management mode set to gb200sw.
#within CMSH
device
foreach -t switch (bmcsettings; get firmwaremanagemode)
#If not set
foreach -n S03-P1-NVSW-[01..09] (bmcsettings; set firmwaremanagemode
gb200sw;commit)
To check against the versions in the firmware update file and ascertain if an update is needed, provide the file name in the firmware flash
--dry
run command.
#Single Switch
cmsh;device; firmware flash -n s03-p1-nvsw-04
nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg --dry-run
#Multiple Switches
cmsh;device; firmware flash -n S03-P1-NVSW-[01-09]
nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg --dry-run
If the changes look correct, then remove the
--dry-run
switch to apply the updates.Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle).
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle).
SBIOS+EROT (Switch BIOS bundle).
Use firmware status -n <switch host name> command to check update progress.
Once complete, do an
power reset
of the NVLink Switch to reboot and activate the new firmware versions.
Example:
[BCM11-HEAD-01->device]% firmware status -n a18-p1-nvsw-09
Device Filename Component Version State Progress Result Size Date
---------------- -------------------------------- ----------------
-------------------- ---------- -------- ------------------- --------
--------
a18-p1-nvsw-09 ASIC 35.2015.1686 current N/A N/A
a18-p1-nvsw-09 BIOS 0ACTV_00.01.012 current N/A N/A
a18-p1-nvsw-09 BMC 88.0002.0956 completed N/A success: activated
N/A
a18-p1-nvsw-09 CPLD1 CPLD000370_REV0500 current N/A N/A
a18-p1-nvsw-09 CPLD2 CPLD000377_REV0800 current N/A N/A
a18-p1-nvsw-09 CPLD3 CPLD000373_REV0800 current N/A N/A
a18-p1-nvsw-09 CPLD4 CPLD000390_REV0300 current N/A N/A
a18-p1-nvsw-09 EROT 01.04.0018.0000_n04 completed N/A success:
activated N/A
a18-p1-nvsw-09 EROT-ASIC1 01.04.0018.0000_n04 current N/A N/A
a18-p1-nvsw-09 EROT-ASIC2 01.04.0018.0000_n04 current N/A N/A
a18-p1-nvsw-09 EROT-BMC 01.04.0018.0000_n04 current N/A N/A
a18-p1-nvsw-09 EROT-CPU 01.04.0018.0000_n04 current N/A N/A
a18-p1-nvsw-09 EROT-FPGA 01.04.0018.0000_n04 current N/A N/A
a18-p1-nvsw-09 FPGA 0.1A current N/A N/A
a18-p1-nvsw-09 SSD CE00A400 current N/A N/A
a18-p1-nvsw-09 transceiver N/A current N/A N/A
Method 3— Standalone nvfwupd tool firmware update process for NVLink Switch#
Doing firmware updates with the nvfwupd tool is an alternative method to using the BCM firmware upgrade process. This method is highly manual.
To start do
module load cm-nvfwupd
(if the NVIDIA Mission Control enabled license is active), otherwise run the command from the location of the nvfwupd tool.Assess NVLink Switch firmware levels from the nvfwupd tool.
nvfwupd -t ip=<switch IP> user=admin password=Juliet@1234 servertype=gb200switch show_version
Compare the NVLink Switch versions found above with the versions in the update package.
# nvfwupd -t ip=<switch IP> user=admin password=Juliet@1234
servertype=gb200switch show_version -p <file to compare version to>
# In this example all three NVLink Switch update files are passed to
nvfwupdate to compare the versions of all upgradeable components.
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234
servertype=gb200switch show_version -p ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg
System Model: N5400_LD
Part number: 920-9K36K-00MV-GS0
Serial number: MT250660041K
Packages: ['GB200-P4978_0004_250213.1.0', 'GB200-P4978_0006_250205.1.0',
'GB200-P4978_0007_250121.1.2_custom']
Connection Status: Successful
Firmware Devices:
AP Name Sys Version Pkg Version Up-To-Date
------------- -------------------- ---------------------- ----------
ASIC 35.2014.1652 N/A No
BIOS 0ACTV_00.01.012 00.01.012 Yes
BMC 88.0002.0929 88.0002.0930 No
CPLD1 CPLD000370_REV0500 CPLD000370_REV0500 Yes
CPLD2 CPLD000377_REV0600 CPLD000377_REV0600 Yes
CPLD3 CPLD000373_REV0500 CPLD000373_REV0500 Yes
CPLD4 CPLD000390_REV0200 CPLD000390_REV0200 Yes
EROT 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-ASIC1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-ASIC2 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-BMC 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-CPU 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-FPGA 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
FPGA 0.1A 0.1A Yes
SSD CE00A400 N/A No
transceiver N/A N/A No
----------------------------------------------------------------------------
Error Code: 0
Flash the NVLink Switch with the relevant package.
#Replace <switch IP> with the IP address of the switch
nvfwupd -t ip=<switch IP> user=admin password=Juliet@1234
servertype=gb200switch update_fw -p
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw/nvfw_GB200-P4978_0000_241217.1.0_dbg-signed.fwpkg
Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle).
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle).
SBIOS+EROT (Switch BIOS bundle).
After a BMC update, the switch will need an power cycle.
Method 1—NVLink Switch AUX Power Cycle using the nvfwupd tool
Reference: NVLink Switch NVUE power cycle:
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=10.78.195.1 user=admin password=Juliet@1234 servertype=gb200switch activate_fw -c NVUE_PWR_CYCLE Power cycle task was created with ID 4 Status for Job Id 4: {'detail': 'File delete successfully', 'http_status': 200, 'issue': [], 'percentage': '', 'state': 'running', 'status': 'File delete successfully', 'timeout': 5, 'type': '', 'warnings': []}
Method 2—NVLink Switch AUX Power Cycle using NVLink Switch BMC
Connect to the switch BMC (ssh root@<bmc ip> / pass 0penBmc) and do the following command:
echo 1 > /sys/devices/platform/ahb/ahb:apb/ahb:apb:bus@1e78a000/1e78a300.i2c-bus/i2c-5/5-0031/mlxreg-io/hwmon/hwmon*/aux_pwr_cycle
Note
The CPLD and SBIOS versions can be updated sequentially without a power cycle between them. The firmware update command will automatically trigger an AC cycle on the next reboot.
After reboot, check the firmware versions to ensure the update is complete.
Reference: NVLink Switch Successful BMC Update
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t
ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234
servertype=gb200switch show_version -p
~/fw_0.9_releases/switch/nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg
~/fw_0.9_releases/switch/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg
~/fw_0.9_releases/switch/nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg
System Model: N5400_LD
Part number: 920-9K36K-00MV-GS0
Serial number: MT250660041K
Packages: ['GB200-P4978_0004_250213.1.0', 'GB200-P4978_0006_250205.1.0',
'GB200-P4978_0007_250121.1.2_custom']
Connection Status: Successful
Firmware Devices:
AP Name Sys Version Pkg Version Up-To-Date
------------- ------------------ ---------------------- ----------
ASIC 35.2014.1652 N/A No
BIOS 0ACTV_00.01.012 00.01.012 Yes
BMC 88.0002.0930 88.0002.0930 Yes
CPLD1 CPLD000370_REV0500 CPLD000370_REV0500 Yes
CPLD2 CPLD000377_REV0600 CPLD000377_REV0600 Yes
CPLD3 CPLD000373_REV0500 CPLD000373_REV0500 Yes
CPLD4 CPLD000390_REV0200 CPLD000390_REV0200 Yes
EROT 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-ASIC1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-ASIC2 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-BMC 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-CPU 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
EROT-FPGA 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes
FPGA 0.1A 0.1A Yes
SSD CE00A400 N/A No
transceiver N/A N/A No
------------------------------------------------------------------------
Error Code: 0
Method 4—firmware updates within NVOS for NVLink Switch#
If the installed license does not support the NVIDIA Mission Control feature, but updates need to be done anyway, it can be done using the NVOS itself.
Assess NVLink Switch firmware levels from the NVOS.
nv show platform firmware
Example: Login to NVLink Switch and get firmware software version info:
#Firmware
admin@S04-P1-NVSW-01:~$ nv show platform firmware
Name Actual FW Part Number FW Source
------------- ------------------ ----------------------------- ----------
ASIC 35.2014.1652 920-9K36W-00MV-GS0_Ax default
BIOS 0ACTV_00.01.012 N/A N/A
BMC 88.0002.0929 692-13809-1404-000 N/A
CPLD1 CPLD000370_REV0500 0x0172 N/A
CPLD2 CPLD000377_REV0600 0x0179 N/A
CPLD3 CPLD000373_REV0500 0x0175 N/A
CPLD4 CPLD000390_REV0200 0x0186 N/A
EROT 01.04.0008.0000_n04 N/A N/A
EROT-ASIC1 01.04.0008.0000_n04 N/A N/A
EROT-ASIC2 01.04.0008.0000_n04 N/A N/A
EROT-BMC 01.04.0008.0000_n04 N/A N/A
EROT-CPU 01.04.0008.0000_n04 N/A N/A
EROT-FPGA 01.04.0008.0000_n04 N/A N/A
FPGA 0.1A N/A N/A
SSD CE00A400 Virtium VTPM24CEXI080-BM110006 N/A
transceiver N/A N/A N/A
Note
The CPLD archive is built into a .fwpkg package file type. To perform a CPLD upgrade on the NVLink Switch, unpack this file to obtain the required .vme file.
Download the NVIDIA fwpkg-unpack tool via ID 1090243.
Unpack the CPLD .fwpkg via the fwpkg-unpack tool:
./fwpkg-unpack --unpack nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg
Note
A new CPLD file is extracted with a .bin file extension. Rename the file to have a .vme extension.
BMC firmware update and Reboot (BMC + FPGA + ERoT).
nv action fetch platform firmware BMC
'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvfw_GB200-P4978_0004_250226.1.0_prod-signed.fwpkg'
nv action install platform firmware BMC files
nvfw_GB200-P4978_0004_250226.1.0_prod-signed.fwpkg force
Note
System power cycle must be performed to force BMC to load the new firmware version.
nv action power-cycle system force
CPLD firmware update and skip-reboot (CPLD1 CPLD2 CPLD3 CPLD4).
nv action fetch platform firmware CPLD1
'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/CPLD_Prod_000370_REV0500_000377_REV0600_000373_REV0500_000390_REV0200_4717c08d_image.vme'
nv action install platform firmware CPLD1 files
CPLD_Prod_000370_REV0500_000377_REV0600_000373_REV0500_000390_REV0200_4717c08d_image.vme
force skip-reboot
BIOS firmware upgrade and skip-reboot (SBIOS + ERoT).
nv action fetch platform firmware BIOS
'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg'
nv action install platform firmware BIOS files
nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg force skip-reboot
NVLink Switch—Updating NVOS#
For NVOS updates, outside of doing BCM ZTP automation, must be done on the NVLink Switch itself/NVOS.
Get NVOS Version by sshing to the admin user of the NVLink Switch and then run the
nv show system version
command.
#OS Software
admin@S04-P1-NVSW-01:~$ nv show system version
operational
---------- ----------------------------
kernel 5.10.0-30-2-amd64
build-date Sun Feb 9 18:12:03 UTC 2025
image nvos-25.02.1877
onie 2023.11-5.3.0012-115200
To install a new version of the NVOS, get the binary onto the host:
Use scp to get the binary to the switch and save the file in
/host/nvos-images/
.Or use the fetch command from NVOS to pull the .bin file.
nv action fetch system image
'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvos-amd64-25.02.1884.bin'
Check system images that are present.
admin@S03-P1-NVSW-07:~$ nv show system image
operational
---------- ---------------
current nvos-25.02.1877
next nvos-25.02.1877
partition1 nvos-25.02.1754
partition2 nvos-25.02.1877
Uninstall old images.
# Remove extra NVOS version image installed if present
nv action uninstall system image
admin@S03-P1-NVSW-07:~$ nv action uninstall system image
Action executing ...
Uninstalling image: nvos-25.02.1754
Action executing ...
Image nvos-25.02.1754 uninstalled successfully
Action succeeded
Install the new image.
After the installation is complete, the switch will automatically reboot the updated OS.
#nv action install system image files new-nvos-image.bin
admin@S03-P1-NVSW-07:~$ nv action install system image files
nvos-amd64-25.02.1879.bin
The operation will install the image and initiate a reboot.
Type [y] to install the image and reboot.
Type [N] to abort.
Do you want to continue? [y/N] y
Action executing ...
Installing image: nvos-amd64-25.02.1879.bin
Action executing ...
Performing reboot ...
Action executing ...
Disconnecting from NVOS, system is offline during reboot
Connection to s03-p1-nvsw-07 closed by remote host.
Connection to s03-p1-nvsw-07 closed.
When the switch OS comes back up after the reboot, check that the new OS version was applied using nv show system image.
admin@S03-P1-NVSW-07:~$ nv show system image
operational
---------- ---------------
current nvos-25.02.1879
next nvos-25.02.1879
partition1 nvos-25.02.1877
partition2 nvos-25.02.1879
Check that the cluster apps are running on the switch that has been designated as the NMX-C master (typically NVSW-01).
admin@S04-P1-NVSW-01:~$ nv show cluster apps
Name ID Version Capabilities Components Version Status Reason Additional
Information Summary
-------------- ------------- ----------------------
---------------------------------------------------
---------------------------------------------------------------- ------
------ ------------------------------ -------
nmx-controller nmx-c-nvos 0.9.0_2025-02-11_09-49 sm, gfm, fib, gw-api
sm:2025.01.5, gfm:R570.120, fib-fe:0.9.0 ok
CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry nmx-telemetry 0.9.5 nvl telemetry, gnmi aggregation,
syslog aggregation nvl-telemetry:1.20.1, gnmi-aggregator:1.0.1,
nmx-connector:1.0.1 ok
If this returns No data, and this is not the NMX-C master node, no further action is required. However, if the NVLink Switch is the master the apps need to be configured within the NVOS:
Start cluster apps.
nv set cluster state enabled nv config apply nv config save nv show cluster apps
If the NMX controller (NMX-C) is in the not ok and says CONTROL PLANE_STATE_UNCONFIGURED , the fm_config.cfg file may need to be applied per this section where the fm_config.cfg file is generated.
admin@a18-p1-nvsw-01:~$ nv show cluster apps
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ -------- -------------------------------- -------
nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, fib-fe:0.9.0 not ok NMXC: OK CONTROL_PLANE_STATE_UNCONFIGURED
Re-run the litedaemon installation tool within BCM in order for the switch to show “UP”.
Sometimes after a new NVOS installation, the default factory password gets reset to admin. Login with admin/admin, set the password to Juliet@1234
and then try again.
Example: NVOS default state, password reset:
NVOS switch
admin@s03-p1-nvsw-04's password:
You are required to change your password immediately (administrator
enforced).
███╗ ██╗██╗ ██╗ ██████╗ ███████╗
████╗ ██║██║ ██║██╔═══██╗██╔════╝
██╔██╗ ██║██║ ██║██║ ██║███████╗
██║╚██╗██║╚██╗ ██╔╝██║ ██║╚════██║
██║ ╚████║ ╚████╔╝ ╚██████╔╝███████║
╚═╝ ╚═══╝ ╚═══╝ ╚═════╝ ╚══════╝
Last login: Fri Mar 21 08:58:02 UTC 2025 from 10.78.192.25 on pts/0
Last failed login: Fri Mar 21 10:02:38 UTC 2025 from 10.78.192.25 on
ssh:notty
There was 1 failed login attempt since the last successful login.
WARNING: Your password has expired.
You must change your password now!
New password:
Retype new password:
applied [rev_id: 1]
Number of total successful connections since last 1 days: 3
Your password has been changed since last login
Note
A pause is expected after you have reset the password.
Power Shelf Firmware Update Process#
There are several vendors for power shelves on DGX GB200 NVL72 system. The following instructions are for shelves made by Delta.
Flash the PMC with the latest version.
The response will contain the task number.
curl -k -u admin:password -H "Content-Type: application/octet-stream" -X POST -T <FIRMWARE_FILE> https://<BMC_IP>/redfish/v1/UpdateService/update
Verify that the flash is completed.
curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/TaskService/Tasks/<Task_Number>
Check the PMC version.
curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Managers/SMC
Complete a PSU update by flashing the PSU with the latest firmware.
Repeat Steps 1 and 2 but point to the PSU firmware image in the
<FIRMWARE_FILE>
.Run the following command and check the PSU version and Health from the FirmwareVersion and Status/Health parameters in the output.
curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Chassis/chassis/PowerSubsystem/PowerSupplies/<PS_NUMBER>
Note
A PSU firmware update will temporarily power off the PSU, so we recommend that the rack is idle during the PSU update process.
BlueField and CX7 FW Update Process#
Prior to installation, copy the binary to the compute tray host or use a shared directory. The binary naming format should look like:
fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
The general steps to install NVIDIA networking firmware are as follows:
Start the MST service.
mst start
Query the devices to find the /dev/mst paths of the devices.
mst status -v
Read the current version of firmware on a given device.
flint -d /dev/mst/mt4129_pciconf0 q full
Flash the firmware on the device.
cd to the directory where the binary is stored.
flint -d /dev/mst/mt4129_pciconf0 -i fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
Repeat this for all four CX7 devices.
Reset the CX7 and reboot the host.
mlxfwreset -d mlx5_0 reset
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX7(rev:0) /dev/mst/mt4129_pciconf0 0000:03:00.0 mlx5_0 net-ibp3s0 0
ConnectX7(rev:0) /dev/mst/mt4129_pciconf1 0002:03:00.0 mlx5_1 net-ibP2p3s0 0
ConnectX7(rev:0) /dev/mst/mt4129_pciconf2 0010:03:00.0 mlx5_4 net-ibP16p3s0 1
ConnectX7(rev:0) /dev/mst/mt4129_pciconf3 0012:03:00.0 mlx5_5 net-ibP18p3s0 1
For BlueField 3, the process is the same with the exception of the device being /dev/mst/mt41692.
Combined CX-7 and BlueField Update#
pdsh -g category=dgx-gb200 '/home/nvis/(dir where the firmware update is)/nicupdate.sh > /home/nvis/(dir where the firmware update is)/$(hostname)_fw_upgrade\_$(date +'%Y%m%d-%H%M%S').log'
Reference script: NIC updates (Both BF3 and CX-7) - nicupdate.sh:
#CX-7 Update
mst start
flint -d /dev/mst/mt4129_pciconf0 q full
flint -d /dev/mst/mt4129_pciconf1 q full
flint -d /dev/mst/mt4129_pciconf2 q full
flint -d /dev/mst/mt4129_pciconf3 q full
#BlueField 3 Update
flint -d /dev/mst/mt41692_pciconf0 q full
flint -d /dev/mst/mt41692_pciconf1 q full
basedir=/home/<user>/fw_0.9_releases/mellanox
bf3file=fw-BlueField-3-rel-32_43_2408-900-9D3B6-00CN-P_Ax-NVME-20.4.1-UEFI-21.4.13-UEFI-22.4.14-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin
cx7file=fw-ConnectX7-rel-28_43_2110-900-24768-0002_Ax-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin
yes \| flint -d /dev/mst/mt4129_pciconf0 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf1 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf2 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf3 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt41692_pciconf0 -i $basedir/$bf3file b
yes \| flint -d /dev/mst/mt41692_pciconf1 -i $basedir/$bf3file b