GB200/GB300 Rack Firmware Update#
Rack Firmware Updates#
Overall, firmware updates using Base Command Manager (BCM) 11 software for a GB200/GB300 NVL72 rack can be done once all the GB200/GB300 compute trays, NVLink Switch trays, and power shelves are up in BCM. The latest FW/SW recipe must be followed for the installation on all devices to be successful. There are also methods to update the firmware using the standalone nvfwupdate tool that are documented here. This section provides instructions to upgrade the firmware for each major GB200/GB300 rack component (compute tray, NVLink switch, power shelf).
Note
FW packages for DGX SuperPOD are unique and different from the GB200 reference architecture package.
Reference: DGX GB200/GB300 Compute Tray Files Required for Update on DGX SuperPOD
The following list the general file names to expect for the DGX GB200 Compute Tray Firmware Update and NVLink Switch Firmware Update. For more information, look for the specific DGX GB200/GB300 SW/FW Release Notes on the NVIDIA Enterprise Support Portal. Specific filenames for each release can be found in the section “Multi-Node System Software Stack Package Contents”
Component |
Filename |
|---|---|
DGX GB200 SW/FW Release Notes |
|
Compute BMC bundle |
nvfw_DGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg |
Compute HMC bundle |
nvfw_HGX-GBX00_0023_<date>.*_custom_prod-signed.fwpkg |
BF3 |
fw-Bluefield-3-rel-*.bin |
CX7 |
fw-ConnectX7-rel-*.bin |
Switch NVOS |
nvos-amd64-*.bin |
Switch BMC bundle |
nvfw_GB200-P4978_0004.*.fwpkg |
Switch BIOS bundle |
nvfw_GB200-P4978_0006.*.fwpkg |
Switch CPLD bundle |
nvfw_GB200-P4978_0007.*.fwpkg |
Powershelf PSU |
NVIDIA_5500_APP_.*.tar |
Powershelf PMC |
common-pmc-3.*.tar |
Firmware updates for the GB200/GB300 compute trays can be done by:
BCM 11 integrated firmware update tool
Standalone nvfwupd tool
GB200/GB300 compute tray firmware update—general steps
Obtain the compute tray package.
Ensure that compute tray BMC has username “admin” enabled and that the credentials are known. If username “admin” does not exist or is disabled, it must be created and enabled before the compute tray update. BCM or any rack management systems should migrate to using “admin” as the default BMC account going forward as the previously used “root” will be disabled going forward.
Note
The “root” username is disabled going forward.
If using BCM to do the firmware update (FW):
Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
Confirm in the compute tray bmcsettings (at the node level or category level) that the firmware management mode is set to
GB200.
Check the current node’s FW versions against the update packages.
Execute a dry-run to confirm the FW will update to the expected versions.
Update the BMC package first (Compute BMC bundle), then the compute tray package (Compute HMC bundle). AC power-cycle the trays after each component update is complete.
NVLink Switch tray firmware update—General Steps
Obtain the NVLink Switch firmware.
If using BCM to do the firmware update:
Place the files in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw.
Confirm that in the NVLink Switch device bmcsettings, the firmware management mode is set to
GB200sw.
Check the current NVLink Switch firmware versions against the update packages.
Execute a dry-run to confirm the firmware will update to the expected version.
Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle)
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle)
SBIOS+EROT (Switch BIOS bundle)
Update the Switch NVOS from within the OS or using ZTP.
Reboot the switch trays after each component update is complete, to apply and activate the new firmware.
Compute Tray Firmware Update Process#
The following sections provide instructions to update the firmware for the GB200/GB300 compute trays using the BCM/NVIDIA Mission Control integrated firmware update tool and the standalone nvfwupd tool.
Method 1: BCM/NVIDIA Mission Control integrated firmware update for compute tray#
To use the firmware update tool in BCM 11, an NVIDIA Mission Control enabled license must be registered:
Place firmware update packages in the correct BCM directory: /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
Copy the prod-signed.fwpkg images up to the BCM head node. The files must be placed in the following directory to be visible to the firmware command.
scp <BINARY_FILES> user@<HEAD_NODE>:/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200Reference: BCM file directory structure for firmware updates.
/cm/local/apps/cmd/etc/htdocs/bios/firmware/ README.md b200/ gb200/ gb200sw/ gh200/ h100/ ilo/
Note
The gb200 folder is for compute tray firmware, the gb200sw folder is for NVLink Switch firmware.
Use the
firmware infocommand in BCM to gather information on the current firmware levels of the nodes. This command provides details about the files and what their purpose is.$ cmsh;device;firmware info [BCM11-HEAD-01->device]% firmware info Device Filename Component Version State Progress Result Size Date ------------- ------------------------------------------------ ------------- ------------------------------ ---------- --------- -------- --------- --------------------- BCM11-HEAD-01 nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg GB200-BMC DGX-GBX00_0024_250215.1.0_custom available N/A - 64MiB 2025-02-15, 16:39:41 BCM11-HEAD-01 nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg GB200-Switch GB200-P4978_0004_250213.1.0 available N/A - 75MiB 2025-02-13, 10:23:28 BCM11-HEAD-01 nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg GB200-Switch GB200-P4978_0006_250205.1.0 available N/A - 16.2MiB 2025-02-05, 15:11:49 BCM11-HEAD-01 nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg GB200-Switch GB200-P4978_0007_250121.1.2_custom available N/A - 1.64MiB 2025-01-21, 13:55:30 BCM11-HEAD-01 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg GB200-Compute HGX-GBX00_0023_250223.1.1_custom available N/A - 114MiB 2025-02-23, 20:20:42
Note
This will display the file names and target (such as GB200 or Switch) of all available firmware binaries. If the files do not show up with this command, they cannot be flashed by the BCM firmware manager. The officially released packages will have a common filename structure starting with nvfw_DGX-GBX00_<IDENTIFIER>_<DATE>.
Confirm GB200 Tray BMC Access/Connectivity.
The BMC of each node needs to be configured in BCM. This should be done at the category level. Ensure that no bmc settings are added at the node level so that the compute trays inherit the settings from the category level.
Enter cmsh and show the current BMC settings for a given node or use the category level for GB200 compute trays since all their default passwords are the same (for DGX).
#category level category; use <dgx-category>;bmcsettings; show #device level device; use <device name>; bmcsettings; show Only use the device level to confirm that nothing has been set. It will show as if they have not been set before as indicated by an asterisk. [bcm11-headnode->device*[s03-p1-dgx-01-c06\*]->bmcsettings\*]% #Use this command to clear uncommitted changes refresh
Populate the bmcsettings fields in the dgx-gb200 category if it is not already populated.
$ cmsh;category use dgx-gb200;bmcsettings; set username root # or admin if username admin is enabled set password <bmc password> set userid 1 set firmwaremanagemode gb200 commit
Note
It is critical that the firmware management mode here is set to
gb200.
Test that the BMC is configured by reading the current FW component versions.
#At the specific device level $ cmsh; device use <dgx-node-name>; firmware status [BCM11-HEAD-01->device[s03-p1-dgx-01-c06]]% firmware status Device Filename Component Version State Progress Result Size Date --------------------- ------------------------- -------------------------- -------------------------- --------- --------- ------- ----- ----- s03-p1-dgx-01-c06 CX7_0 28.42.1270 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 CX7_1 28.42.1270 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 CX7_2 28.42.1270 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 CX7_3 28.42.1270 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_BMC_0 GB200Nvl-24.12-8 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_CPLD_0 0x00 0x0b 0x03 0x04 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_CPLD_1 0x00 0x0b 0x03 0x04 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_CPLD_2 0x00 0x10 0x01 0x0f current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_CPLD_3 0x00 0x10 0x01 0x0f current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 FW_ERoT_BMC_0 01.03.0262.0000_n04 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 Full_FW_Image_NIC_Slot_4 32.42.1000 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 Full_FW_Image_NIC_Slot_7 32.42.1000 current N/A N/A N/A N/A N/A s03-p1-dgx-01-c06 UEFI buildbrain-gcid-38635631 current N/A N/A N/A N/A N/A #Alternatively, at the device prompt look at a specific device cmsh; device;firmware status -n s03-p1-dgx-01-c06 #At the category level to see all of the compute tray FW in one shot cmsh; device;firmware status -c dgx-gb200 #At the rack level cmsh; device;firmware status -r <rack location>
As a validation step prior to executing the flash operation, the dry-run option will show exactly what is changing when the firmware is flashed:
Perform a flash dry-run of the BMC firmware.
cmsh;device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg --dry-run -n <DEVICE_NAME>' #The <DEVICE_NAME> can have some regex to apply the change to multiple devices simultaneously s03-p1-dgx-01-c0[1-2] - This will run the command against both s03-p1-dgx-01-c01 and s03-p1-dgx-01-c02 #Device names can also be comma separated to run against multiple individual devices i.e. s03-p1-dgx-01-c01,s03-p1-dgx-01-c02 *Example: Dry run output* Device Component Target Version Package version Up to date Action Result Error ----------------- ---------------- ---------------- -------------------- -------------------- ---------------- ---------------- -------- -------------------------------- s03-p1-dgx-01-c06 HGX_FW_BMC_0 HGX_FW_BMC_0 GB200Nvl-25.01-D GB200Nvl-25.01-E no install good s03-p1-dgx-01-c06 HGX_FW_CPU_0 HGX_FW_CPU_0 02.03.19 02.03.20 no install good s03-p1-dgx-01-c06 HGX_FW_CPU_1 HGX_FW_CPU_1 02.03.19 02.03.20 no install good s03-p1-dgx-01-c06 HGX_FW_ERoT_BMC_0 HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_0 HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_CPU_1 HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_0 HGX_FW_FPGA_0 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_1 HGX_FW_FPGA_1 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_GPU_0 HGX_FW_GPU_0 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_1 HGX_FW_GPU_1 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_2 HGX_FW_GPU_2 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_3 HGX_FW_GPU_3 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_0 HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_ERoT_FPGA_1 HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_0 HGX_FW_FPGA_0 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_FPGA_1 HGX_FW_FPGA_1 1.20 1.20 yes skip good s03-p1-dgx-01-c06 HGX_FW_GPU_0 HGX_FW_GPU_0 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_1 HGX_FW_GPU_1 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_2 HGX_FW_GPU_2 97.00.82.00.13 97.00.82.00.19 no install good s03-p1-dgx-01-c06 HGX_FW_GPU_3 HGX_FW_GPU_3 97.00.82.00.13 97.00.82.00.19 no install good
Ensure that the components that are not up-to-date, are going to be updated to the expected package versions.
Start the firmware update.
$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <DEVICE_NAME>'
Once the payload is uploaded to the node it will say good.
[BCM11-HEAD-01->device]% firmware flash nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg -n s03-p1-dgx-01-c[04-06] Device Firmware Package Result ------------------- ---------------------------------------------------- ------- s03-p1-dgx-01-c04 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good s03-p1-dgx-01-c05 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg good
When the command completes, periodically check the status of the update until it has completed.
This will have a percentage complete while the flashing is ongoing and a “complete” message when the flash has finished.
Example output from firmware status command:
$ cmsh -c 'device; firmware status -n <DEVICE_NAME>' s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0 GB200Nvl-25.01-D flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0 02.03.19 flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1 02.03.19 flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0 97.00.82.00.13 flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1 97.00.82.00.13 flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2 97.00.82.00.13 flashing 0.0% 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3 97.00.82.00.13 flashing 0.0% s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_BMC_0 GB200Nvl-25.01-D -> GB200Nvl-25.01-E pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_0 02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_CPU_1 02.03.19 -> 02.03.20 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_0 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_1 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_2 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB s03-p1-dgx-01-c06 nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg HGX_FW_GPU_3 97.00.82.00.13 -> 97.00.82.00.19 pending N/A success: medium-specific reset or dc power cycle or ac power cy+ 114MiB``
At the end of the BMC update, the administrator can AC power cycle the compute tray to complete the BMC update, then proceed with updating other components.
Activating firmware using the AC power cycle (AUX_PWR_CYCLE)
Note
The GB200/GB300 compute tray has two levels of power:
Primary (system) power: This is the power supplied to the compute tray CPUs and GPUs. This must be powered off before the aux_cycle process.
Standby (AUX) power: This is the power supplied to the BMC and low-level components. Cycling standby power is an automated process that temporarily removes power from the compute tray, reinitializing all hardware components. The BMC will be unavailable for several minutes during the aux_cycle process. Once completed, the primary power can be toggled on again.
Perform the AC power cycle once both components have completed the firmware update by either of the two methods.
Power Cycle Methods#
Two primary methods are available to perform an AC (auxiliary) power cycle of the GB200/GB300 compute tray after firmware updates:
Power Cycle Method 1: AUX_PWR_CYCLE using Redfish#
To perform the power cycle using Redfish API calls directly to the BMC:
From the head node, power down the system:
curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType": "ForceOff"}' https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.ResetPerform the AC power cycle (removal of auxiliary power):
curl -k -u ${USER}:${PASS} -H "Content-Type: application/json" -X POST -d '{"ResetType":"AuxPowerCycle"}' https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerResetAfter the cycle, power on the system using Redfish:
curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "On"}' -X POST
Examples: Powering on nodes using cmsh
While not part of the AC power cycle itself, the following commands can be used to power on nodes after the update process as needed:
Power on a single compute node:
cmsh;device;use <compute node under test>;power onPower on multiple nodes in a category:
cmsh;device;foreach -c dgx-gb200 (power on)Power on all nodes in a category:
cmsh;device;power on -c dgx-gb200Power on specific nodes by name:
cmsh;device;power on -n <specific nodes>
Power Cycle Method 2: BCM “power auxcycle” Command (available in 11.25.08 and later)#
An AC power cycle can also be performed via the BCM command line within the device context.
Ensure the node is powered off:
[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ ON ] dgx-gb200-m06-c1 [BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power off rf0 ...................... [ OFF ] dgx-gb200-m06-c1
Note
If the node is still ON when executing the
power auxcyclecommand, an error message will be returned:[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle rf0 ...................... [ FAILED ] dgx-gb200-m06-c1 (System power is not OFF)
After confirming the node is OFF, perform the auxiliary power cycle:
[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power auxcycle rf0 ...................... [AUX CYCLE]
During auxcycle, the BMC will be unavailable for several minutes. “power status” will indicate failure until the process is complete:
[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ FAILED ] dgx-gb200-m06-c1 (Unable to establish session)
When auxcycle completes, the node status will return to OFF:
[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power status rf0 ...................... [ OFF ] dgx-gb200-m06-c1
Power on the node:
[BCM11-HEAD-01->device[dgx-gb200-m06-c1]]% power on rf0 ...................... [ ON ] dgx-gb200-m06-c1
If issues arise, getting the debug output can help root-cause some issues. Use the flash command with debug options enabled to get debug output.
$ cmsh -c 'device; firmware flash nvfw_DGX-GBX00_0023_241223.1.0_custom_prod-signed.fwpkg -n <device name> -v --debug'
Method 3: Standalone nvfwupd tool for compute tray#
If the license does not support NVIDIA Mission Control, the built-in cm-nvfwupd command will not work. To use the standalone nvfwupd tool, follow the steps below:
Download the standalone
nvfwupdtool from the enterprise support portal. This tool can be used independent of BCM.Or install
nvfwupdpackage from the cuda apt repository.
Get the correct firmware update packages for the update. To see the full contents of a fwupd.pkg, use the show_pkg_content command.
$ ./nvfwupd show_pkg_content -p ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
Get current state of the hardware with show_version.
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200 show_version -p ./nvfw_GB200-P4972_0012_250214.1.0_custom_prod-signed.fwpkg ./nvfw_GB200-P4975_0011_250206.1.1_custom_recovery_prod-signed.fwpkg System Model: GB200 NVL Part number: 699-24764-0001-RC1 Serial number: 1334524170073 Packages: ['GB200-P4972_0012_250214.1.0_custom', 'GB200-P4975_0011_250206.1.1_custom_recovery'] Connection Status: Successful Firmware Devices: AP Name Sys Version Pkg Version Up-To-Date ----------------------- ----------------------- ------------------------- ---------- CX7_0 28.43.2108 N/A No CX7_1 28.43.2108 N/A No CX7_2 28.43.2108 N/A No CX7_3 28.43.2108 N/A No FW_BMC_0 GB200Nvl-25.01-D GB200Nvl-25.01-E No FW_CPLD_0 0x00 0x0b 0x03 0x04 N/A No FW_CPLD_1 0x00 0x0b 0x03 0x04 N/A No FW_CPLD_2 0x00 0x10 0x01 0x0f N/A No FW_CPLD_3 0x00 0x10 0x01 0x0f N/A No FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes Full_FW_Image_NIC_Slot_4 32.43.2408 N/A No Full_FW_Image_NIC_Slot_7 32.43.2408 N/A No UEFI buildbrain-gcid-39281046 N/A No HGX_FW_BMC_0 GB200Nvl-25.01-D N/A No HGX_FW_CPLD_0 0.1C N/A No HGX_FW_CPU_0 02.03.19 N/A No HGX_FW_CPU_1 02.03.19 N/A No HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.03.0196.0001 Yes HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.03.0196.0001 Yes HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.03.0196.0001 Yes HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.03.0196.0001 Yes HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.03.0196.0001 Yes HGX_FW_FPGA_0 1.20 N/A No HGX_FW_FPGA_1 1.20 N/A No HGX_FW_GPU_0 97.00.82.00.13 1.0.61.0 No HGX_FW_GPU_1 97.00.82.00.13 1.0.61.0 No HGX_FW_GPU_2 97.00.82.00.13 1.0.61.0 No HGX_FW_GPU_3 97.00.82.00.13 1.0.61.0 No HGX_InfoROM_GPU_0 G548.0201.00.06 N/A No HGX_InfoROM_GPU_1 G548.0201.00.06 N/A No HGX_InfoROM_GPU_2 G548.0201.00.06 N/A No HGX_InfoROM_GPU_3 G548.0201.00.06 N/A No HGX_PCIeSwitchConfig_0 01151024 N/A No ----------------------------------------------------------------------------------------------- Error Code: 0
Create the payload .jsons for the BMC and the compute tray:
#Reference: UpdateBMC.json for updating BMC: { "Targets" :[] } *Reference: UpdateCompute.json for updating HGX:* { "Targets" :["/redfish/v1/Chassis/HGX_Chassis_0"] }
Run the BMC update first.
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200 update_fw -s UpdateBMC.json -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg
Power off the system, then do an AUX Power cycle.
./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200 activate_fw -c PWR_OFF #wait 15 seconds ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200 activate_fw -c RF_AUX_PWR_CYCLE
Check if the BMC update was successful.
Reference: Successful BMC update:
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=<rf0 ip> user=root password=0penBmc servertype=GB200 update_fw -s UpdateBMC.json -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg Updating ip address: ip=XXXX FW package: ['./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg'] Ok to proceed with firmware update? <Y/N> y {"@odata.id": "/redfish/v1/TaskService/Tasks/3", "@odata.type": "#Task.v1_4_3.Task", "Id": "3", "TaskState": "Running", "TaskStatus": "OK"} FW update started, Task Id: 3 Wait for Firmware Update to Start... TaskState: Running PercentComplete: 20 TaskStatus: OK TaskState: Running PercentComplete: 40 TaskStatus: OK TaskState: Running PercentComplete: 60 TaskStatus: OK TaskState: Completed PercentComplete: 100 TaskStatus: OK Firmware update successful! Overall Time Taken: 0:13:01 Refer to 'NVIDIA Firmware Update Document' on activation steps for new firmware to take effect. ---------------------------------------------------------------------- Error Code: 0
Do the full compute tray flash. Ensure that the system is fully up and, in its OS, to be able to do the GPU VBIOS updates.
./nvfwupd -t ip=<rf0 ip> user=admin password=<bmc password> servertype=GB200 update_fw -s UpdateCompute.json -p ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg
Like the BMC in step 15, power down the system and then do an AUX power cycle.
Power on the machine, let it provision/boot up, then check the firmware level again.
Example output from firmware show version command:
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=10.78.194.13 user=admin password=<bmc password> servertype=GB200 show_version -p ./nvfw_DGX-GBX00_0024_250215.1.0_custom_prod-signed.fwpkg ./nvfw_HGX-GBX00_0023_250223.1.1_custom_prod-signed.fwpkg System Model: GB200 NVL Part number: 692-13809-2404-RC1 Serial number: 1330125050101 Packages: ['DGX-GBX00_0024_250215.1.0_custom', 'HGX-GBX00_0023_250223.1.1_custom'] Connection Status: Successful Firmware Devices: AP Name Sys Version Pkg Version Up-To-Date ------------------------- ------------------------ -------------------------- ---------- CX7_0 28.43.2108 N/A No CX7_1 28.43.2108 N/A No CX7_2 28.43.2108 N/A No CX7_3 28.43.2108 N/A No FW_BMC_0 GB200Nvl-25.01-E GB200Nvl-25.01-E Yes FW_CPLD_0 0x00 0x0b 0x03 0x04 N/A No FW_CPLD_1 0x00 0x0b 0x03 0x04 N/A No FW_CPLD_2 0x00 0x10 0x01 0x0f N/A No FW_CPLD_3 0x00 0x10 0x01 0x0f N/A No FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes Full_FW_Image_NIC_Slot_4 32.43.2408 N/A No Full_FW_Image_NIC_Slot_7 32.43.2408 N/A No UEFI buildbrain-gcid-39556194 N/A No HGX_FW_BMC_0 GB200Nvl-25.01-E GB200Nvl-25.01-E Yes HGX_FW_CPLD_0 0.1C 0.1C Yes HGX_FW_CPU_0 02.03.20 02.03.20 Yes HGX_FW_CPU_1 02.03.20 02.03.20 Yes HGX_FW_ERoT_BMC_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes HGX_FW_ERoT_CPU_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes HGX_FW_ERoT_CPU_1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes HGX_FW_ERoT_FPGA_0 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes HGX_FW_ERoT_FPGA_1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes HGX_FW_FPGA_0 1.20 1.20 Yes HGX_FW_FPGA_1 1.20 1.20 Yes HGX_FW_GPU_0 97.00.82.00.19 97.00.82.00.19 Yes HGX_FW_GPU_1 97.00.82.00.19 97.00.82.00.19 Yes HGX_FW_GPU_2 97.00.82.00.19 97.00.82.00.19 Yes HGX_FW_GPU_3 97.00.82.00.19 97.00.82.00.19 Yes HGX_InfoROM_GPU_0 G548.0201.00.06 N/A No HGX_InfoROM_GPU_1 G548.0201.00.06 N/A No HGX_InfoROM_GPU_2 G548.0201.00.06 N/A No HGX_InfoROM_GPU_3 G548.0201.00.06 N/A No HGX_PCIeSwitchConfig_0 01151024 N/A No
Applying and verifying firmware update success#
After all required firmware is installed, the compute node needs an AC cycle to fully apply the updates. This procedure can be used to bring the nodes down and back up. First connect to the GB200 tray BMC OS, then:
Power off the host.
# Checks that the current status is on curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0 \| jq '."PowerState"' # Shuts down the OS Graceful shutdown: curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "GracefulShutdown"}' -X POST Force power off: curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "ForceOff"}' -X POST
AC cycle the node.
curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Chassis/BMC_0/Actions/Oem/NvidiaChassis.AuxPowerReset -d '{"ResetType":"AuxPowerCycleForce"}' -X POST
Wait for the BMC to ping again (should take 2-3 min). Once the BMC pings, bring the host back up.
# Checks that the current status is off (if it is 'on' no further action required) curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0 | jq '."PowerState"' #Power On curl -k -u ${USER}:${PASS} https://${BMCIP}/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "On"}' -X POST
When the BMC and host are back up, validate that the firmware install was successful.
$ cmsh -c 'device; firmware status -n <DEVICE_NAME>'
NVLink Switch Firmware Update Process#
For the NVLink Switch, the firmware updates consist of the firmware of the switch itself and the NVOS software.
NVLink Switch tray assumptions#
Non-scale-out design (NVL72x1)—all NVLink ports are connected to MN-NVLink cable cartridge.
All tray interfaces are set to receive IPs through DHCP.
The rack inventory import process or manual entry process must be completed, and all switch entries must appear in the cmsh devices list.
Example: NVLink Switch BCM switch device list
root@BCM11-HEAD-01:~# cmsh -c "device; list -t switch -f hostname:15,mac:20,ip:12,status:11 \|grep -i nvsw " S03-P1-NVSW-01 E0:9D:73:F0:4C:DE 10.78.195.1 [ UP ]+ S03-P1-NVSW-02 E0:9D:73:3F:EB:28 10.78.195.2 [ UP ]+ S03-P1-NVSW-03 E0:9D:73:3F:E7:30 10.78.195.3 [ UP ]+ S03-P1-NVSW-04 E0:9D:73:3F:EA:C8 10.78.195.4 [ UP ]+ S03-P1-NVSW-05 E0:9D:73:3F:E4:F0 10.78.195.5 [ UP ]+ S03-P1-NVSW-06 E0:9D:73:3F:E2:C8 10.78.195.6 [ UP ]+ S03-P1-NVSW-07 E0:9D:73:3F:E2:50 10.78.195.7 [ UP ]+ S03-P1-NVSW-08 E0:9D:73:3F:E5:18 10.78.195.8 [ UP ]+ S03-P1-NVSW-09 E0:9D:73:3F:E4:F8 10.78.195.9 [ UP ]+ S04-P1-NVSW-01 E0:9D:73:F0:41:4E 10.78.195.31 [ UP ]+ S04-P1-NVSW-02 E0:9D:73:F0:59:16 10.78.195.32 [ UP ]+ S04-P1-NVSW-03 E0:9D:73:F0:41:8E 10.78.195.33 [ UP ]+ S04-P1-NVSW-04 E0:9D:73:F0:41:36 10.78.195.34 [ UP ]+ S04-P1-NVSW-05 E0:9D:73:F0:41:A6 10.78.195.35 [ UP ]+ S04-P1-NVSW-06 E0:9D:73:F0:45:36 10.78.195.36 [ UP ]+ S04-P1-NVSW-07 E0:9D:73:F0:4D:7E 10.78.195.37 [ UP ]+ S04-P1-NVSW-08 E0:9D:73:F0:3D:56 10.78.195.38 [ UP ]+ S04-P1-NVSW-09 E0:9D:73:F0:4D:B6 10.78.195.39 [ UP ]+
Note
For switches, the cm-lite-daemon needs to be up and running for the switch to appear as [UP].
Example: NVLink Switch BCM switch information
Parameter |
Value |
|---|---|
Hostname |
a05-p1-nvsw-01 |
IP |
7.241.3.1 |
Network |
ipminet2 |
Revision |
|
Type |
Switch |
Mac |
E0:9D:73:3F:E0:50 |
Model |
|
Ports |
0 |
Kind |
nvlink |
Control script |
|
Control script timeout |
5 |
SNMP Settings |
<submode> |
Lowest port |
1 |
Uplinks |
|
Disable port detection |
yes |
Disable port mapping |
no |
Activation |
Sun, 23 Feb 2025 12:55:30 PST |
Rack |
A05:19 |
Chassis |
< not set > |
Access Settings |
<submode> |
Priority |
0 |
VLAN cache time |
5m |
Has client daemon |
yes |
ZTP Settings |
<submode> |
Subnet manager |
no |
Disable SNMP |
yes |
GUID |
00000000-0000-0000-0000-000000000000 |
Services |
<0 in submode> |
NV configuration mode |
AUTO |
Members |
|
Management network |
ipminet2 |
Power control |
rf0 |
Custom power script |
|
Custom power script arg |
|
Power distribution units |
|
Default gateway metric |
0 |
Switch ports |
|
Interfaces |
<3 in submode> |
BMC Settings |
<submode> |
Userdefined1 |
|
Userdefined2 |
|
User defined resources |
|
Supports GNSS |
no |
Custom ping script |
|
Custom ping script arg |
|
Partition base |
|
Part number |
|
Serial number |
|
Notes |
<0B> |
Prometheus metric / forwarders |
<0 in submode> |
Example: BCM NVLink Switch interfaces output
[BCM11-HEAD-01->device[B05-P1-NVSW-01]->interfaces]% list
Type Network Device name IP Network Start if
------------ -------------------- ---------------- ------------ --------
bmc rf0 7.241.5.21 ipminet3 always
physical eth0 7.241.5.1 ipminet3 always
physical eth1 7.241.5.11 ipminet3 always
All NVLink Switches per rack are reachable by its BMC and COMe0/COMe1 port IP address:
Copper connections confirmed.
Speed/Bandwidth (200G for COMe0 and COMe1).
IP Address assigned by BCM to the COMe0 and COMe1 network (ipminetx).
Logical connectivity (access):
SSH to NVLink switch BMC can be done (default user/pass = root/JulietBmc@123)
SSH to NVOS on each NVLink switch can be done (default user/pass = admin/Juliet1234).
Note
If the NVLink Switch has any issues and the default NVOS password above is not working, try admin/admin.
Method 1: BCM/NVIDIA Mission Control firmware update integrated process for NVLink Switch#
Get a summary of the firmware update files uploaded to BCM from the
/cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200swdirectory. If none exist, upload the flash files to that directory.Verify the files with this command
cmsh -c 'device; firmware info'. Ensure that all the files show up with the GB200-Switch designation.Example: Firmware update file list for NVLink Switch devices:
cmsh;device;firmware info \| grep -i GB200-Switch #Or get it from the individual node entry [BCM11-HEAD-01->device[a05-p1-nvsw-09]]% firmware info Device Filename Component Version State Progress Size Date ------------- -------------------------------------------------------- ------------ ------------------------------------ --------- --------- --------- --------------------- BCM11-HEAD-01 nvfw_GB200-P4978_0000_250213.1.0_dbg-signed.fwpkg GB200-Switch GB200-P4978\_0000\_250213.1.0 available N/A 71MiB 2025-02-13, 10:05:51 BCM11-HEAD-01 nvfw_GB200-P4978_0002_250205.1.0_dbg-signed.fwpkg GB200-Switch GB200-P4978\_0002\_250205.1.0 available N/A 16.2MiB 2025-02-05, 15:49:59 BCM11-HEAD-01 nvfw_GB200-P4978_0003_250121.1.2_custom_dbg-signed.fwpkg GB200-Switch GB200-P4978\_0003\_250121.1.2_custom available N/A 1.64MiB 2025-01-21, 13:55:25
Use the firmware status command from the BCM device submenu to find the current firmware levels of the NVLink Switch.
Example: Firmware status command from BCM
#Do for individual node [BCM11-HEAD-01->device]% firmware status -n a05-p1-nvsw-09 #Do for all nodes [BCM11-HEAD-01->device]% firmware status -t switch \| grep -i nvsw #Can also pull at the rack level if desired [BCM11-HEAD-01->device]% firmware status -r <rack location> \| grep -i nvsw
Example: Firmware status command output
Device Filename Component Version State Progress Result Size Date ---------------- -------------------------------- ---------------- -------------------- -------- -------- -------- -------- -------- a05-p1-nvsw-09 ASIC 35.2014.1698 current N/A N/A a05-p1-nvsw-09 BIOS 0ACTV_00.01.012 current N/A N/A a05-p1-nvsw-09 BMC 88.0002.0956 current N/A N/A a05-p1-nvsw-09 CPLD1 CPLD000370_REV0500 current N/A N/A a05-p1-nvsw-09 CPLD2 CPLD000377_REV0800 current N/A N/A a05-p1-nvsw-09 CPLD3 CPLD000373_REV0800 current N/A N/A a05-p1-nvsw-09 CPLD4 CPLD000390_REV0300 current N/A N/A a05-p1-nvsw-09 EROT 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 EROT-ASIC1 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 EROT-ASIC2 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 EROT-BMC 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 EROT-CPU 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 EROT-FPGA 01.04.0018.0000_n04 current N/A N/A a05-p1-nvsw-09 FPGA 0.1A current N/A N/A a05-p1-nvsw-09 SSD CE00A400 current N/A N/A a05-p1-nvsw-09 transceiver N/A current N/A N/A
Ensure that all NVLink Switch BMCs have their firmware management mode set to
gb200sw.#within CMSH device foreach -t switch (bmcsettings; get firmwaremanagemode) #If not set foreach -n S03-P1-NVSW-[01..09] (bmcsettings; set firmwaremanagemode gb200sw;commit)
To check against the versions in the firmware update file and ascertain if an update is needed, provide the file name in the firmware flash
--dryrun command.#Single Switch cmsh;device; firmware flash -n s03-p1-nvsw-04 nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg --dry-run #Multiple Switches cmsh;device; firmware flash -n S03-P1-NVSW-[01-09] nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg --dry-run
If the changes look correct, then remove the
--dry-runswitch to apply the updates.Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle).
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle).
SBIOS+EROT (Switch BIOS bundle).
Use firmware status -n <SWITCH_HOST_NAME> command to check update progress.
Once complete, do an
power resetof the NVLink Switch to reboot and activate the new firmware versions.Example: Firmware status command output
[BCM11-HEAD-01->device]% firmware status -n a18-p1-nvsw-09 Device Filename Component Version State Progress Result Size Date ---------------- -------------------------------- ---------------- -------------------- ---------- -------- ------------------- -------- -------- a18-p1-nvsw-09 ASIC 35.2015.1686 current N/A N/A a18-p1-nvsw-09 BIOS 0ACTV_00.01.012 current N/A N/A a18-p1-nvsw-09 BMC 88.0002.0956 completed N/A success: activated N/A a18-p1-nvsw-09 CPLD1 CPLD000370_REV0500 current N/A N/A a18-p1-nvsw-09 CPLD2 CPLD000377_REV0800 current N/A N/A a18-p1-nvsw-09 CPLD3 CPLD000373_REV0800 current N/A N/A a18-p1-nvsw-09 CPLD4 CPLD000390_REV0300 current N/A N/A a18-p1-nvsw-09 EROT 01.04.0018.0000_n04 completed N/A success: activated N/A a18-p1-nvsw-09 EROT-ASIC1 01.04.0018.0000_n04 current N/A N/A a18-p1-nvsw-09 EROT-ASIC2 01.04.0018.0000_n04 current N/A N/A a18-p1-nvsw-09 EROT-BMC 01.04.0018.0000_n04 current N/A N/A a18-p1-nvsw-09 EROT-CPU 01.04.0018.0000_n04 current N/A N/A a18-p1-nvsw-09 EROT-FPGA 01.04.0018.0000_n04 current N/A N/A a18-p1-nvsw-09 FPGA 0.1A current N/A N/A a18-p1-nvsw-09 SSD CE00A400 current N/A N/A a18-p1-nvsw-09 transceiver N/A current N/A N/A
Method 2: Standalone nvfwupd tool firmware update process for NVLink Switch#
Doing firmware updates with the nvfwupd tool is an alternative method to using the BCM firmware upgrade process. This method is highly manual.
To start do
module load cm-nvfwupd(if the NVIDIA Mission Control enabled license is active), otherwise run the command from the location of the nvfwupd tool.Assess NVLink Switch firmware levels from the nvfwupd tool.
nvfwupd -t ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234 servertype=gb200switch show_versionCompare the NVLink Switch versions found above with the versions in the update package.
# nvfwupd -t ip=<switch IP> user=admin password=Juliet@1234 servertype=gb200switch show_version -p <file to compare version to> # In this example all three NVLink Switch update files are passed to nvfwupdate to compare the versions of all upgradeable components. root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234 servertype=gb200switch show_version -p ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg System Model: N5400_LD Part number: 920-9K36K-00MV-GS0 Serial number: MT250660041K Packages: ['GB200-P4978_0004_250213.1.0', 'GB200-P4978_0006_250205.1.0', 'GB200-P4978_0007_250121.1.2_custom'] Connection Status: Successful Firmware Devices: AP Name Sys Version Pkg Version Up-To-Date ------------- -------------------- ---------------------- ---------- ASIC 35.2014.1652 N/A No BIOS 0ACTV_00.01.012 00.01.012 Yes BMC 88.0002.0929 88.0002.0930 No CPLD1 CPLD000370_REV0500 CPLD000370_REV0500 Yes CPLD2 CPLD000377_REV0600 CPLD000377_REV0600 Yes CPLD3 CPLD000373_REV0500 CPLD000373_REV0500 Yes CPLD4 CPLD000390_REV0200 CPLD000390_REV0200 Yes EROT 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-ASIC1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-ASIC2 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-BMC 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-CPU 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-FPGA 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes FPGA 0.1A 0.1A Yes SSD CE00A400 N/A No transceiver N/A N/A No ---------------------------------------------------------------------------- Error Code: 0
Flash the NVLink Switch with the relevant package.
#Replace <switch IP> with the IP address of the switch nvfwupd -t ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234 servertype=gb200switch update_fw -p /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw/nvfw_GB200-P4978_0000_241217.1.0_dbg-signed.fwpkg
Update the tray level firmware first in this order:
BMC+FPGA+ERoT (Switch BMC bundle).
CPLD1 CPLD2 CPLD3 CPLD4 (Switch CPLD bundle).
SBIOS+EROT (Switch BIOS bundle).
After a BMC update, the switch will need an AUX power cycle.
Reference: NVLink Switch AUX Power Cycle using the nvfwupd tool:
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=10.78.195.1 user=admin password=Juliet@1234 servertype=gb200switch activate_fw -c NVUE_PWR_CYCLE Power cycle task was created with ID 4 Status for Job Id 4: {'detail': 'File delete successfully', 'http_status': 200, 'issue': [], 'percentage': '', 'state': 'running', 'status': 'File delete successfully', 'timeout': 5, 'type': '', 'warnings': []}
Note
The CPLD and SBIOS versions can be updated sequentially without a power cycle between them. The firmware update command will automatically trigger an AC cycle on the next reboot.
After reboot, check the firmware versions to ensure the update is complete.
Reference: NVLink Switch Successful BMC Update
root@BCM11-HEAD-01:~/nvfwup/release files v2.0.5/aarch64# ./nvfwupd -t ip=<NVLink Switch COMe0 IP> user=admin password=Juliet@1234 servertype=gb200switch show_version -p ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0004_250213.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg ~/fw_0.9_releases/switch/nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg System Model: N5400_LD Part number: 920-9K36K-00MV-GS0 Serial number: MT250660041K Packages: ['GB200-P4978_0004_250213.1.0', 'GB200-P4978_0006_250205.1.0', 'GB200-P4978_0007_250121.1.2_custom'] Connection Status: Successful Firmware Devices: AP Name Sys Version Pkg Version Up-To-Date ------------- ------------------ ---------------------- ---------- ASIC 35.2014.1652 N/A No BIOS 0ACTV_00.01.012 00.01.012 Yes BMC 88.0002.0930 88.0002.0930 Yes CPLD1 CPLD000370_REV0500 CPLD000370_REV0500 Yes CPLD2 CPLD000377_REV0600 CPLD000377_REV0600 Yes CPLD3 CPLD000373_REV0500 CPLD000373_REV0500 Yes CPLD4 CPLD000390_REV0200 CPLD000390_REV0200 Yes EROT 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-ASIC1 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-ASIC2 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-BMC 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-CPU 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes EROT-FPGA 01.04.0008.0000_n04 01.04.0008.0000_n04 Yes FPGA 0.1A 0.1A Yes SSD CE00A400 N/A No transceiver N/A N/A No ------------------------------------------------------------------------ Error Code: 0
Method 3: Firmware updates within NVOS for NVLink Switch#
If the installed license does not support the NVIDIA Mission Control feature, but updates need to be done anyway, it can be done using the NVOS itself.
Assess NVLink Switch firmware levels from the NVOS.
$ nv show platform firmware
Example: Login to NVLink Switch and get firmware software version info:
#Firmware admin@S04-P1-NVSW-01:~$ nv show platform firmware Name Actual FW Part Number FW Source ------------- ------------------ ----------------------------- ---------- ASIC 35.2014.1652 920-9K36W-00MV-GS0_Ax default BIOS 0ACTV_00.01.012 N/A N/A BMC 88.0002.0929 692-13809-1404-000 N/A CPLD1 CPLD000370_REV0500 0x0172 N/A CPLD2 CPLD000377_REV0600 0x0179 N/A CPLD3 CPLD000373_REV0500 0x0175 N/A CPLD4 CPLD000390_REV0200 0x0186 N/A EROT 01.04.0008.0000_n04 N/A N/A EROT-ASIC1 01.04.0008.0000_n04 N/A N/A EROT-ASIC2 01.04.0008.0000_n04 N/A N/A EROT-BMC 01.04.0008.0000_n04 N/A N/A EROT-CPU 01.04.0008.0000_n04 N/A N/A EROT-FPGA 01.04.0008.0000_n04 N/A N/A FPGA 0.1A N/A N/A SSD CE00A400 Virtium VTPM24CEXI080-BM110006 N/A transceiver N/A N/A N/A
Note
The CPLD archive is built into a .fwpkg package file type. To perform a CPLD upgrade on the NVLink Switch, unpack this file to obtain the required .vme file.
Download the NVIDIA fwpkg-unpack tool using the Enterprise Support Portal ID 1090243.
Use the following command to unpack the CPLD .fwpkg using the fwpkg-unpack tool:
$ ./fwpkg-unpack --unpack nvfw_GB200-P4978_0007_250121.1.2_custom_prod-signed.fwpkg
Note
A new CPLD file is extracted with a .bin file extension. Rename the file to have a .vme extension.
BMC firmware update and Reboot (BMC + FPGA + ERoT).
$ nv action fetch platform firmware BMC 'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvfw_GB200-P4978_0004_250226.1.0_prod-signed.fwpkg' $ nv action install platform firmware BMC files nvfw_GB200-P4978_0004_250226.1.0_prod-signed.fwpkg force
Note
System power cycle must be performed to force BMC to load the new firmware version.
$ nv action power-cycle system force
CPLD firmware update and skip-reboot (CPLD1 CPLD2 CPLD3 CPLD4).
$ nv action fetch platform firmware CPLD1 'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/CPLD_Prod_000370_REV0500_000377_REV0600_000373_REV0500_000390_REV0200_4717c08d_image.vme' $ nv action install platform firmware CPLD1 files CPLD_Prod_000370_REV0500_000377_REV0600_000373_REV0500_000390_REV0200_4717c08d_image.vme force skip-reboot
BIOS firmware upgrade and skip-reboot (SBIOS + ERoT).
$ nv action fetch platform firmware BIOS 'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg' $ nv action install platform firmware BIOS files nvfw_GB200-P4978_0006_250205.1.0_prod-signed.fwpkg force skip-reboot
NVLink Switch: Updating NVOS#
For NVOS updates, outside of doing BCM ZTP automation, must be done on the NVLink Switch itself/NVOS.
Get NVOS Version by sshing to the admin user of the NVLink Switch and then run the
nv show system versioncommand.#OS Software admin@S04-P1-NVSW-01:~$ nv show system version operational ---------- ---------------------------- kernel 5.10.0-30-2-amd64 build-date Sun Feb 9 18:12:03 UTC 2025 image nvos-25.02.1877 onie 2023.11-5.3.0012-115200
To install a new version of the NVOS, get the binary onto the host:
Use scp to get the binary to the switch and save the file in
/host/nvos-images/.Or use the fetch command from NVOS to pull the .bin file.
$ nv action fetch system image 'scp://root:nvis1234!@192.168.255.254/var/www/html/nvswitch/images/0.9.03/nvos-amd64-25.02.1884.bin'
Check system images that are present.
$ nv show system image operational ---------- --------------- current nvos-25.02.1877 next nvos-25.02.1877 partition1 nvos-25.02.1754 partition2 nvos-25.02.1877
Uninstall old images.
# Remove extra NVOS version image installed if present $ nv action uninstall system image admin@S03-P1-NVSW-07:~$ nv action uninstall system image Action executing ... Uninstalling image: nvos-25.02.1754 Action executing ... Image nvos-25.02.1754 uninstalled successfully Action succeeded
Install the new image.
After the installation is complete, the switch will automatically reboot the updated OS.
$ nv action install system image files nvos-amd64-25.02.1879.bin The operation will install the image and initiate a reboot. Type [y] to install the image and reboot. Type [N] to abort. Do you want to continue? [y/N] y Action executing ... Installing image: nvos-amd64-25.02.1879.bin Action executing ... Performing reboot ... Action executing ... Disconnecting from NVOS, system is offline during reboot Connection to s03-p1-nvsw-07 closed by remote host. Connection to s03-p1-nvsw-07 closed.
When the switch OS comes back up after the reboot, check that the new OS version was applied using nv show system image.
$ nv show system image operational ---------- --------------- current nvos-25.02.1879 next nvos-25.02.1879 partition1 nvos-25.02.1877 partition2 nvos-25.02.1879
Check that the cluster apps are running on the switch that has been designated as the NMX-C master (typically NVSW-01).
$ nv show cluster apps Name ID Version Capabilities Components Version Status Reason Additional Information Summary -------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ ------ ------------------------------ ------- nmx-controller nmx-c-nvos 0.9.0_2025-02-11_09-49 sm, gfm, fib, gw-api sm:2025.01.5, gfm:R570.120, fib-fe:0.9.0 ok CONTROL_PLANE_STATE_CONFIGURED nmx-telemetry nmx-telemetry 0.9.5 nvl telemetry, gnmi aggregation, syslog aggregation nvl-telemetry:1.20.1, gnmi-aggregator:1.0.1, nmx-connector:1.0.1 ok
If this returns No data, and this is not the NMX-C master node, no further action is required. However, if the NVLink Switch is the master the apps need to be configured within the NVOS:
Start cluster apps.
nv set cluster state enabled nv config apply nv config save nv show cluster apps
If the NMX controller (NMX-C) is in the not ok and says CONTROL PLANE_STATE_UNCONFIGURED , the fm_config.cfg file may need to be applied per this section where the fm_config.cfg file is generated.
$ nv show cluster apps Name ID Version Capabilities Components Version Status Reason Additional Information Summary -------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ -------- -------------------------------- ------- nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, fib-fe:0.9.0 not ok NMXC: OK CONTROL_PLANE_STATE_UNCONFIGURED
Re-run the litedaemon installation tool within BCM in order for the switch to show “UP”. Sometimes after a new NVOS installation, the default factory password gets reset to admin. Login with admin/admin, set the password to
Juliet@1234and then try again.Example: NVOS default state, password reset:
NVOS switch admin@s03-p1-nvsw-04's password: You are required to change your password immediately (administrator enforced). ███╗ ██╗██╗ ██╗ ██████╗ ███████╗ ████╗ ██║██║ ██║██╔═══██╗██╔════╝ ██╔██╗ ██║██║ ██║██║ ██║███████╗ ██║╚██╗██║╚██╗ ██╔╝██║ ██║╚════██║ ██║ ╚████║ ╚████╔╝ ╚██████╔╝███████║ ╚═╝ ╚═══╝ ╚═══╝ ╚═════╝ ╚══════╝ Last login: Fri Mar 21 08:58:02 UTC 2025 from 10.78.192.25 on pts/0 Last failed login: Fri Mar 21 10:02:38 UTC 2025 from 10.78.192.25 on ssh:notty There was 1 failed login attempt since the last successful login. WARNING: Your password has expired. You must change your password now! New password: Retype new password: applied [rev_id: 1] Number of total successful connections since last 1 days: 3 Your password has been changed since last login
Note
A pause is expected after you have reset the password.
Power Shelf Firmware Update Process#
There are several vendors for power shelves on DGX GB200/GB300 NVL72 system(s). The following instructions are for shelves made by Delta.
Flash the PMC with the latest version.
The response will contain the task number.
$ curl -k -u admin:password -H "Content-Type: application/octet-stream" -X POST -T <FIRMWARE_FILE> https://<BMC_IP>/redfish/v1/UpdateService/update
Verify that the flash is completed.
$ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/TaskService/Tasks/<TASK_NUMBER>
Check the PMC version.
$ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Managers/PMC_0
Complete a PSU update by flashing the PSU with the latest firmware.
$ curl -k -u admin:password -X GET https://<BMC_IP>/redfish/v1/Chassis/chassis/PowerSubsystem/PowerSupplies/<PS_NUMBER>
Repeat Steps 1 and 2, but use the PSU firmware image in the
<FIRMWARE_FILE>.Run the following command to check the PSU version and Health from the
FirmwareVersionandStatusandHealthparameters in the output.
Note
A PSU firmware update will temporarily power off the PSU, so it is recommended that the rack is idle during the PSU update process.
BlueField and CX7 FW Update Process#
Prior to installation, copy the binary to the compute tray host or use a shared directory. The binary naming format should look like:
$ fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
The general steps to install NVIDIA networking firmware are as follows:
Start the MST service.
$ mst start
Query the devices to find the /dev/mst paths of the devices.
$ mst status -v
Read the current version of firmware on a given device.
$ flint -d /dev/mst/mt4129_pciconf0 q full
Flash the firmware on the device.
Change to the directory where the firmware binary is stored.
$ flint -d /dev/mst/mt4129_pciconf0 -i fw-ConnectX7-rel-28_42_1270-900-24768-0002_Ax-UEFI-14.35.15-FlexBoot-3.7.500.signed.bin
Repeat this for all four CX7 devices.
Reset the CX7 and reboot the host.
$ mlxfwreset -d mlx5_0 reset PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA ConnectX7(rev:0) /dev/mst/mt4129_pciconf0 0000:03:00.0 mlx5_0 net-ibp3s0 0 ConnectX7(rev:0) /dev/mst/mt4129_pciconf1 0002:03:00.0 mlx5_1 net-ibP2p3s0 0 ConnectX7(rev:0) /dev/mst/mt4129_pciconf2 0010:03:00.0 mlx5_4 net-ibP16p3s0 1 ConnectX7(rev:0) /dev/mst/mt4129_pciconf3 0012:03:00.0 mlx5_5 net-ibP18p3s0 1
For BlueField 3, the process is the same with the exception of the device being
/dev/mst/mt41692.
Combined CX-7 and BlueField Update#
$ pdsh -g category=dgx-gb200 '/home/nvis/(dir where the firmware update is)/nicupdate.sh > /home/nvis/(dir where the firmware update is)/$(hostname)_fw_upgrade\_$(date +'%Y%m%d-%H%M%S').log'
Reference script: NIC updates (Both BF3 and CX-7) - nicupdate.sh:
# CX-7 Update
$ mst start
$ flint -d /dev/mst/mt4129_pciconf0 q full
$ flint -d /dev/mst/mt4129_pciconf1 q full
$ flint -d /dev/mst/mt4129_pciconf2 q full
$ flint -d /dev/mst/mt4129_pciconf3 q full
# BlueField 3 Update
$ flint -d /dev/mst/mt41692_pciconf0 q full
$ flint -d /dev/mst/mt41692_pciconf1 q full
$ basedir=/home/<USERNAME>/fw_0.9_releases/mellanox
$ bf3file=fw-BlueField-3-rel-32_43_2408-900-9D3B6-00CN-P_Ax-NVME-20.4.1-UEFI-21.4.13-UEFI-22.4.14-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin
$ cx7file=fw-ConnectX7-rel-28_43_2110-900-24768-0002_Ax-UEFI-14.36.21-FlexBoot-3.7.500.signed.bin
yes \| flint -d /dev/mst/mt4129_pciconf0 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf1 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf2 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt4129_pciconf3 -i $basedir/$cx7file b
yes \| flint -d /dev/mst/mt41692_pciconf0 -i $basedir/$bf3file b
yes \| flint -d /dev/mst/mt41692_pciconf1 -i $basedir/$bf3file b
DGX OS Update#
Compatible drivers and software packages need to be installed to align with the new firmware.
Clone OS image in BCM.
Boot one node with new image.
Install MFT, DOCA, NVIDIA driver package.
# make sure the external repo is pointed to for doca packages $ cat /etc/apt/sources.list.d/doca.source Types: deb URIs: https://linux.mellanox.com/public/repo/doca/DGX_GBxx_latest_DOCA/ubuntu24.04/arm64-sbsa/ Suites: / Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg # Install doca package $ sudo apt-get update $ sudo apt install doca-all # Install driver package $ sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-570.158.01_1.0-1_arm64.deb $ sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.158.01/nvidia-driver-local-5778B6CA-keyring.gpg /usr/share/keyrings/ $ sudo mv /etc/apt/sources.list.d/cuda-compute-repo.sources /etc/apt/sources.list.d/cuda-compute-repo.sources.disabled $ sudo apt update $ sudo apt install nvidia-driver-570-open $ sudo apt-get install nvidia-imex-570 $ sudo apt-get install nvidia-fabricmanager-570 $ sudo apt-get install libnvidia-nscq-570 # Check doca packages $ sudo dpkg -l | grep 2.10.0-093520 # Check driver package $ sudo dpkg -l | grep 570.158
Save changes into the image.
Reboot compute node into new image with AUTO install.
Set all nodes to boot from new image and reboot.
Operational Security Requirements#
Factory Reset after Debug Token Usage
After using one (or more) Debug Token on the compute tray, the operator must remove the Debug Token and factory reset the non-volatile storage of the BMC, HMC and CPU. The following RedFish APIs provide the factory reset functionality:
Resetting the HMC R/W filesystem:
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/HGX_BMC_0/Actions/Manager.ResetToDefaults -d '{"ResetToDefaultsType": "ResetAll"}'
Erasing the HMC eMMC:
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/HGX_BMC_0/Actions/Oem/eMMC.SecureErase
Erasing the Grace “R/W” SPI flashes (Perform on both Grace modules):
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Chassis/HGX_ProcessorModule_0/Actions/Oem/NvidiaProcessor.VariableSpiErase
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Chassis/HGX_ProcessorModule_1/Actions/Oem/NvidiaProcessor.VariableSpiErase
Resetting the BCM R/W filesystem:
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/BMC_0/Actions/Manager.ResetToDefaults -d '{"ResetToDefaultsType": "ResetAll"}'
Erasing the BMC eMMC:
curl -k -u $USER:$PASS -X POST https://${TARGET_HOSTNAME}/redfish/v1/Managers/BMC_0/Actions/Oem/eMMC.SecureErase