Firmware Update Steps#

Before You Begin#

  • Stop all unnecessary system activity before you begin the firmware update.

  • Stop all GPU activity, including running the nvidia-smi command. GPU activity and running the command can prevent the VBIOS update.

  • Do not add additional loads on the system, such as user jobs, diagnostics, or monitoring services, while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.

  • When you begin the firmware update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If you encounter the warning, take action to reduce the workload before proceeding with the firmware update.

  • Fan speeds can increase during the BMC firmware update. This increase in speed is a normal part of the BMC firmware update process.

  • If you plan to upgrade from version 1.0.0 (BMC 23.05.11) or 1.1.1 (BMC 23.09.20) to version 24.09.1 (BMC 24.09.17), you must first upgrade to version 1.1.3 (BMC 24.01.05) and then to version 24.09.1 to include all critical security changes. If you upgrade directly from version 1.0.0 or 1.1.1 to version 24.09.1, you are required to perform a factory reset to restore the default settings.

Update Duration#

Updating the firmware on the motherboard tray components and the GPU tray components requires approximately 90 minutes. Updating the firmware on the ConnectX-7 devices requires approximately 30 minutes.

Update Steps#

  1. View the installed versions compared with the newly available firmware.

    nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \
      show_version -p nvfw_DGX_250220.1.0.fwpkg \
      nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg
    
  2. Update the BMC.

    1. Create a file, such as update_bmc.json, with the following contents:

      {
          "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"]
      }
      
    2. Run the following command to update the BMC:

      nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
        -p nvfw_DGX_250220.1.0.fwpkg -y -s update_bmc.json
      
  3. Reboot the BMC.

    • Use the shell on the system:

      # If you have a shell on the system
      $ sudo ipmitool mc reset cold
      
      # If you are logged in to a different system
      $ ipmitool -H <bmc-ip-address> -I lanplus -U <bmc-username> -P <bmc-password> mc reset cold
      
    • Alternatively, you can use the Web UI through a browser.

  4. Update the components on the motherboard tray.

    For a one-shot firmware update, the BMC will perform a firmware update on all components in the provided bundle, for example, nvfw_DGX_xxxxxx.x.x.fwpkg, which includes the Host BMC (if the force_update option is specified), Host BIOS, EROT, PCIe Retimer, PCIe Switch, PSU, Motherboard CPLD, and Midplane CPLD.

    1. Create a file, such as mb_tray.json, with empty braces:

      {}
      
    2. Update the firmware:

      nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
        -p nvfw_DGX_250220.1.0.fwpkg -y -s mb_tray.json
      
  5. Update the components on the GPU tray.

    1. Create a gpu_tray.json file with the following contents:

      {
          "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HGX_0"]
      }
      
    2. Update the firmware:

      nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
        -p nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg -y -s gpu_tray.json
      

      This step performs parallel updates on all the components contained in the GPU tray, such as VBIOS, NVSwitch, EROTs, and FPGA.

    3. Verify that the background copy has been completed successfully by looking for "BackgroundCopyStatus": "Completed" in the following command output:

      curl -s -k -u <bmc-user>:<bmc-password> -H content-type:application/json \
           -X GET https://<bmc-ip-address>/redfish/v1/Chassis/HGX_ERoT_BMC_0 | jq
      
    4. Perform a cold reset to restart the system:

      ipmitool chassis power cycle
      
  6. Confirm the firmware update is complete by viewing the installed versions again.

    After the system is operational again, repeat the following command to confirm all firmware has been updated:

    nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \
      show_version -p nvfw_DGX_250220.1.0.fwpkg \
      nvfw_HGX_DGXH100-H200x8_250131.1.0.fwpkg
    
  7. Execute background copy commands for the BMC and the system BIOS.

    1. BMC:

      Background copy Redfish API request:

      curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage' \
           --header 'Content-Type: application/json' \
           --data '{
                   "Targets": ["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"]
                   }'
      

      Example response:

      {
         "@odata.type":"#UpdateService.v1_11_0.UpdateService",
         "Messages":[
            {
               "@odata.type":"#Message.v1_0_8.Message",
               "Message":"A new task /redfish/v1/TaskService/Tasks/1 was created.",
               "MessageArgs":[
                  "/redfish/v1/TaskService/Tasks/1"
               ],
               "MessageId":"Task.1.0.New",
               "Resolution":"None",
               "Severity":"OK"
            },
            {
               "@odata.type":"#Message.v1_0_8.Message",
               "Message":"ActivateFirmware Action is initiated.",
               "MessageId":"UpdateService.1.0.StartActivateFirmware",
               "Resolution":"None",
               "Severity":"OK"
            }
         ]
      }
      

      Query the update status using the task ID, which is 1, as shown in the output response:

      nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> show_update_progress -i 1
      

      When the status indicates 100% complete, proceed with the next step.

    2. SBIOS:

      Background copy Redfish API request:

      curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/UpdateService/Actions/Oem/NvidiaUpdateService.CommitImage' \
           --header 'Content-Type: application/json' \
           --data '{
                   "Targets": ["/redfish/v1/UpdateService/FirmwareInventory/HostBIOS_0"]
                   }'
      

      Find the task ID from the response, which is usually 2, to query the update status:

      nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> show_update_progress -i 2
      

      When the status indicates 100% complete, proceed with the next step.

    3. Perform an AC power cycle on the system by unplugging all the power supplies and then reconnecting them either manually or through an external PDU device.

  8. Update the firmware on the network cards and NVMe drives.

    Note

    During the update, the mlxfwmanager command will report the ConnectX-7 device identified as /dev/mst/mt4129_pciconf0 cannot be updated as shown in the following error message:

    -E- Failed to query /dev/mst/mt4129_pciconf0 device, error : MFE_ICMD_BAD_PARAM
    

    This behavior is expected because this device is not one of the networking cards used to cluster the system, but a bridge device used internally and updated using a separate process.

    1. To update the ConnectX®-7 cards and NVIDIA® BlueField®-3 cards, navigate to the network directory and run the mlxfwmanager command:

      cd network
      sudo mlxfwmanager -u -D .
      

      When prompted to update all 10 ConnectX-7 cards and BlueField-3 cards, type Y to confirm.

      For a complete output, refer to Firmware Update Output of mlxfwmanager.

    2. For firmware update on the Intel E810-C Ethernet Network Adapters, refer to Updating the Intel NIC Firmware.

    3. For firmware update on the NVMe drives, refer to Updating the NVMe Firmware.

Firmware Update Output of mlxfwmanager#

$ sudo mlxfwmanager -u -D .
Querying Mellanox devices firmware ...

Device #1:
----------

Device Type:      ConnectX7
Part Number:      --
Description:
PSID:
PCI Device Name:  /dev/mst/mt4129_pciconf0
Base GUID:        N/A
Base MAC:         N/A
Versions:         Current        Available
   FW:            --
Status:           Failed to open device

Device #2:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf1
Base GUID:        b8e924030081db74
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #3:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf2
Base GUID:        b8e924030081db7c
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #4:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf3
Base GUID:        b8e924030081db78
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #5:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf4
Base GUID:        b8e924030081db70
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #6:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf5
Base GUID:        b8e924030081d954
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #7:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf6
Base GUID:        b8e924030081d95c
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #8:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf7
Base GUID:        b8e924030081d958
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #9:
----------

Device Type:      ConnectX7
Part Number:      MCX750500B-0D00_Ax_Bx
Description:      Nvidia adapter card with four ConnectX-7; each up to 400Gb/s IB (default mode) or 400GbE; PCIe 5.0 x32; PCIe switch; crypto disabled; secure boot enabled
PSID:             MT_0000000891
PCI Device Name:  /dev/mst/mt4129_pciconf8
Base GUID:        b8e924030081d950
Versions:         Current        Available
   FW:            28.42.1000     28.43.2026
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
Status:           Update required

Device #10:
----------

Device Type:      BlueField3
Part Number:      900-9D3B6-00CN-A_Ax
Description:      NVIDIA BlueField-3 B3240 P-Series Dual-slot FHHL DPU; 400GbE / NDR IB (default mode); Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID:             MT_0000000883
PCI Device Name:  /dev/mst/mt41692_pciconf0
Base GUID:        b8e9240300a65e18
Base MAC:         b8e924a65e18
Versions:         Current        Available
   FW:            32.42.1000     32.43.2024
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
   UEFI Virtio blk:   22.4.0013      N/A
   UEFI Virtio net:   21.4.0013      N/A
Status:           Update required

Device #11:
----------

Device Type:      BlueField3
Part Number:      900-9D3B6-00CN-A_Ax
Description:      NVIDIA BlueField-3 B3240 P-Series Dual-slot FHHL DPU; 400GbE / NDR IB (default mode); Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID:             MT_0000000883
PCI Device Name:  /dev/mst/mt41692_pciconf1
Base GUID:        b8e9240300a667be
Base MAC:         b8e924a667be
Versions:         Current        Available
   FW:            32.42.1000     32.43.2024
   PXE:           3.7.0500       N/A
   UEFI:          14.35.0015     N/A
   UEFI Virtio blk:   22.4.0013      N/A
   UEFI Virtio net:   21.4.0013      N/A
Status:           Update required


---------
-E- Failed to query /dev/mst/mt4129_pciconf0 device, error : MFE_ICMD_BAD_PARAM
Found 10 device(s) requiring firmware update...

Perform FW update? [y/N]: y
Device #1: Device query failed
Device #2: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #3: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #4: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #5: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #6: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #7: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #8: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #9: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #10: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
Device #11: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done

Restart needed for updates to take effect.
-E- One or more errors were encountered.