Firmware Update Steps
Before You Begin
Stop all unnecessary system activity before you begin the firmware update.
Stop all GPU activity, including running the
nvidia-smi
command. GPU activity and running the command can prevent the VBIOS update.Do not add additional loads on the system, such as user jobs, diagnostics, or monitoring services, while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
When you begin the firmware update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If you encounter the warning, take action to reduce the workload before proceeding with the firmware update.
Fan speeds can increase during the BMC firmware update. This increase in speed is a normal part of the BMC firmware update process.
Update Duration
Updating the firmware on the motherboard tray components and the GPU tray components requires approximately 90 minutes. Updating the firmware on the ConnectX-7 devices requires approximately 30 minutes.
Update Steps
View the installed versions compared with the newly available firmware:
nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \ show_version -p nvfw_DGXH100_231206.1.0.fwpkg \ nvfw_HGX_DGXH100_231101.1.0.fwpkg
Update the BMC.
Create a file, such as
update_bmc.json
, with the following contents:{ "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"] }
Run the following command to update the BMC:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_DGXH100_231206.1.0.fwpkg -y -s update_bmc.json
Reset the BMC so that it is used after the next reboot:
# If you have a shell on the system $ sudo ipmitool mc reset cold # If you are logged in to a different system $ ipmitool -H <bmc-ip-address> -I lanplus -U <bmc-username> -P <bmc-password> mc reset cold
Reboot the system.
Update the components on the motherboard tray.
For one-shot firmware update, the BMC will perform firmware update on all components in the provided bundle, for example,
nvfw_DGX-H100_0003_231110.1.0_custom_prod-signed.fwpkg
, which includes Host BMC, Host BIOS, EROT, PCIe Retimer, PCIe Switch, PSU, Motherboard CPLD, and Midplane CPLD.Create a file, such as
mb_tray.json
, with the following contents:{"Targets":[]}
Update the firmware:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_DGXH100_231206.1.0.fwpkg -y -s mb_tray.json
Tip
Update the BMC and BIOS firmware a second time, specifying the
force_update
argument. The second update ensures that the primary and backup copies of the firmware in NVRAM are both up to date.When you specify the
force_update
argument, thenvfwupd
command forces firmware update without checking the firmware version. If the version of the firmware available for the component is the same as the version currently installed on the component, the BMC will skip the update for that component.Update the components in the GPU tray.
Create a
gpu_tray.json
file with the following contents:{ "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HGX_0"] }
Update the firmware:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_HGX_DGXH100_231101.1.0.fwpkg -y -s gpu_tray.json
This step performs parallel updates on all the components contained in the GPU tray, such as VBIOS, NVSwitch, EROTs, and FPGA.
Update the firmware on the ConnectX-7 controllers.
Update the firmware on the cards that are used for cluster communication:
sudo mstflint -d /sys/bus/pci/devices/0000:5e:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:dc:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:c0:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:18:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:40:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:4f:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:ce:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:9a:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b
Update the firmware on the cards that are used for storage communication:
sudo mstflint -d /sys/bus/pci/devices/0000:aa:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX755206AS-NEA_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b sudo mstflint -d /sys/bus/pci/devices/0000:29:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX755206AS-NEA_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin b
Perform an AC power cycle on the system by unplugging all the power supplies and then reconnecting them either manually or through an external PDU device.
Wait for the operating system to boot.
Confirm the firmware update is complete by viewing the installed versions again:
nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \ show_version -p nvfw_DGXH100_231206.1.0.fwpkg \ nvfw_HGX_DGXH100_231101.1.0.fwpkg