Firmware Update Steps#
Before You Begin#
Stop all unnecessary system activity before you begin the firmware update.
Stop all GPU activity, including running the
nvidia-smi
command. GPU activity and running the command can prevent the VBIOS update.Do not add additional loads on the system, such as user jobs, diagnostics, or monitoring services, while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
When you begin the firmware update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If you encounter the warning, take action to reduce the workload before proceeding with the firmware update.
Fan speeds can increase during the BMC firmware update. This increase in speed is a normal part of the BMC firmware update process.
If you plan to upgrade from version 1.0.0 (BMC 23.05.11) or 1.1.1 (BMC 23.09.20) to version 24.09.1 (BMC 24.09.17), you must first upgrade to version 1.1.3 (BMC 24.01.05) and then to version 24.09.1 to include all critical security changes. If you upgrade directly from version 1.0.0 or 1.1.1 to version 24.09.1, you are required to perform a factory reset to restore the default settings.
Update Duration#
Updating the firmware on the motherboard tray components and the GPU tray components requires approximately 90 minutes. Updating the firmware on the ConnectX-7 devices requires approximately 30 minutes.
Update Steps#
View the installed versions compared with the newly available firmware.
nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \ show_version -p nvfw_DGX_250220.1.0.fwpkg \ nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg
Update the BMC.
Create a file, such as
update_bmc.json
, with the following contents:{ "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"] }
Run the following command to update the BMC:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_DGX_250220.1.0.fwpkg -y -s update_bmc.json
Reboot the BMC.
Use the shell on the system:
# If you have a shell on the system $ sudo ipmitool mc reset cold # If you are logged in to a different system $ ipmitool -H <bmc-ip-address> -I lanplus -U <bmc-username> -P <bmc-password> mc reset cold
Alternatively, you can use the Web UI through a browser.
Update the components on the motherboard tray.
For a one-shot firmware update, the BMC will perform a firmware update on all components in the provided bundle, for example,
nvfw_DGX_xxxxxx.x.x.fwpkg
, which includes the Host BMC (if theforce_update
option is specified), Host BIOS, EROT, PCIe Retimer, PCIe Switch, PSU, Motherboard CPLD, and Midplane CPLD.Create a file, such as
mb_tray.json
, with empty braces:{}
Update the firmware:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_DGX_250220.1.0.fwpkg -y -s mb_tray.json
Update the components on the GPU tray.
Create a
gpu_tray.json
file with the following contents:{ "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HGX_0"] }
Update the firmware:
nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \ -p nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg -y -s gpu_tray.json
This step performs parallel updates on all the components contained in the GPU tray, such as VBIOS, NVSwitch, EROTs, and FPGA.
Verify that the background copy has been completed successfully by looking for
"BackgroundCopyStatus": "Completed"
in the following command output:curl -s -k -u <bmc-user>:<bmc-password> -H content-type:application/json \ -X GET https://<bmc-ip-address>/redfish/v1/Chassis/HGX_ERoT_BMC_0 | jq
Perform a cold reset to restart the system:
ipmitool chassis power cycle
Confirm the firmware update is complete by viewing the installed versions again.
After the system is operational again, repeat the following command to confirm all firmware has been updated:
nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \ show_version -p nvfw_DGX_250220.1.0.fwpkg \ nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg
Execute background copy commands for the BMC and the system BIOS.
BMC:
Background copy Redfish API request:
curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage' \ --header 'Content-Type: application/json' \ --data '{ "Targets": ["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"] }'
Example response:
{ "@odata.type":"#UpdateService.v1_11_0.UpdateService", "Messages":[ { "@odata.type":"#Message.v1_0_8.Message", "Message":"A new task /redfish/v1/TaskService/Tasks/1 was created.", "MessageArgs":[ "/redfish/v1/TaskService/Tasks/1" ], "MessageId":"Task.1.0.New", "Resolution":"None", "Severity":"OK" }, { "@odata.type":"#Message.v1_0_8.Message", "Message":"ActivateFirmware Action is initiated.", "MessageId":"UpdateService.1.0.StartActivateFirmware", "Resolution":"None", "Severity":"OK" } ] }
Query the update status using the task ID, which is
1
, as shown in the output response:nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> show_update_progress -i 1
When the status indicates
100%
complete, proceed with the next step.SBIOS:
Background copy Redfish API request:
curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage' \ --header 'Content-Type: application/json' \ --data '{ "Targets": ["/redfish/v1/UpdateService/FirmwareInventory/HostBIOS_0"] }'
Find the task ID from the response, which is usually
2
, to query the update status:nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> show_update_progress -i 2
When the status indicates
100%
complete, proceed with the next step.Perform an AC power cycle on the system by unplugging all the power supplies and then reconnecting them either manually or through an external PDU device.
Update the firmware on the network cards and NVMe drives.
Note
During the update, the
mlxfwmanager
command will report the ConnectX-7 device identified as/dev/mst/mt4129_pciconf0
cannot be updated as shown in the following error message:-E- Failed to query /dev/mst/mt4129_pciconf0 device, error : MFE_ICMD_BAD_PARAM
This behavior is expected because this device is not one of the networking cards used to cluster the system, but a bridge device used internally and updated using a separate process.
To update the ConnectX®-7 cards and NVIDIA® BlueField®-3 cards, navigate to the
network
directory and run themlxfwmanager
command:cd network sudo mlxfwmanager -u -D .
When prompted to update all 10 ConnectX-7 cards and BlueField-3 cards, type
Y
to confirm.For a complete output, refer to Firmware Update Output of mlxfwmanager.
For firmware update on the Intel E810-C Ethernet Network Adapters, refer to Updating the Intel NIC Firmware.
For firmware update on the NVMe drives, refer to Updating the NVMe Firmware.
Firmware Update Output of mlxfwmanager#
$ sudo mlxfwmanager -u -D .
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX7
Part Number: --
Description:
PSID:
PCI Device Name: /dev/mst/mt4129_pciconf0
Base GUID: N/A
Base MAC: N/A
Versions: Current Available
FW: --
Status: Failed to open device
Device #2:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf1
Base GUID: b8e924030081db74
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #3:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf2
Base GUID: b8e924030081db7c
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #4:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf3
Base GUID: b8e924030081db78
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #5:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf4
Base GUID: b8e924030081db70
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #6:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf5
Base GUID: b8e924030081d954
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #7:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf6
Base GUID: b8e924030081d95c
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #8:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf7
Base GUID: b8e924030081d958
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #9:
----------
Device Type: ConnectX7
Part Number: MCX750500B-0D00_Ax_Bx
Description: Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID: MT_0000000891
PCI Device Name: /dev/mst/mt4129_pciconf8
Base GUID: b8e924030081d950
Versions: Current Available
FW: 28.42.1000 28.43.2026
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
Status: Update required
Device #10:
----------
Device Type: BlueField3
Part Number: 900-9D3B6-00CN-A_Ax
Description: NVIDIA BlueField-3 B3240 P-Series Dual-slot FHHL DPU; 400GbE / NDR IB (default mode); Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID: MT_0000000883
PCI Device Name: /dev/mst/mt41692_pciconf0
Base GUID: b8e9240300a65e18
Base MAC: b8e924a65e18
Versions: Current Available
FW: 32.42.1000 32.43.2024
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
UEFI Virtio blk: 22.4.0013 N/A
UEFI Virtio net: 21.4.0013 N/A
Status: Update required
Device #11:
----------
Device Type: BlueField3
Part Number: 900-9D3B6-00CN-A_Ax
Description: NVIDIA BlueField-3 B3240 P-Series Dual-slot FHHL DPU; 400GbE / NDR IB (default mode); Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID: MT_0000000883
PCI Device Name: /dev/mst/mt41692_pciconf1
Base GUID: b8e9240300a667be
Base MAC: b8e924a667be
Versions: Current Available
FW: 32.42.1000 32.43.2024
PXE: 3.7.0500 N/A
UEFI: 14.35.0015 N/A
UEFI Virtio blk: 22.4.0013 N/A
UEFI Virtio net: 21.4.0013 N/A
Status: Update required
---------
-E- Failed to query /dev/mst/mt4129_pciconf0 device, error : MFE_ICMD_BAD_PARAM
Found 10 device(s) requiring firmware update...
Perform FW update? [y/N]: y
Device #1: Device query failed
Device #2: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #3: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #4: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #5: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #6: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #7: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #8: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #9: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #10: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Device #11: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Done
Restart needed for updates to take effect.
-E- One or more errors were encountered.