Firmware Update Steps

Before You Begin

  • Stop all unnecessary system activity before you begin the firmware update.

  • Stop all GPU activity, including running the nvidia-smi command. GPU activity and running the command can prevent the VBIOS update.

  • Do not add additional loads on the system, such as user jobs, diagnostics, or monitoring services, while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.

  • When you begin the firmware update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If you encounter the warning, take action to reduce the workload before proceeding with the firmware update.

  • Fan speeds can increase during the BMC firmware update. This increase in speed is a normal part of the BMC firmware update process.

Update Duration

Updating the firmware on the motherboard tray components and the GPU tray components requires approximately 90 minutes. Updating the firmware on the ConnectX-7 devices requires approximately 30 minutes.

Update Steps

  1. View the installed versions compared with the newly available firmware:

    nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \
      show_version -p nvfw_DGXH100_231206.1.0.fwpkg \
      nvfw_HGX_DGXH100_231101.1.0.fwpkg
    
  2. Update the BMC.

    Create a file, such as update_bmc.json, with the following contents:

    {
        "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HostBMC_0"]
    }
    

    Run the following command to update the BMC:

    nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
      -p nvfw_DGXH100_231206.1.0.fwpkg -y -s update_bmc.json
    
  3. Reset the BMC so that it is used after the next reboot:

    # If you have a shell on the system
    $ sudo ipmitool mc reset cold
    
    # If you are logged in to a different system
    $ ipmitool -H <bmc-ip-address> -I lanplus -U <bmc-username> -P <bmc-password> mc reset cold
    
  4. Reboot the system.

  5. Update the components on the motherboard tray.

    For one-shot firmware update, the BMC will perform firmware update on all components in the provided bundle, for example, nvfw_DGX-H100_0003_231110.1.0_custom_prod-signed.fwpkg, which includes Host BMC, Host BIOS, EROT, PCIe Retimer, PCIe Switch, PSU, Motherboard CPLD, and Midplane CPLD.

    Create a file, such as mb_tray.json, with the following contents:

    {"Targets":[]}
    

    Update the firmware:

    nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
      -p nvfw_DGXH100_231206.1.0.fwpkg -y -s mb_tray.json
    

    Tip

    Update the BMC and BIOS firmware a second time, specifying the force_update argument. The second update ensures that the primary and backup copies of the firmware in NVRAM are both up to date.

    When you specify the force_update argument, the nvfwupd command forces firmware update without checking the firmware version. If the version of the firmware available for the component is the same as the version currently installed on the component, the BMC will skip the update for that component.

  6. Update the components in the GPU tray.

    Create a gpu_tray.json file with the following contents:

    {
        "Targets" :["/redfish/v1/UpdateService/FirmwareInventory/HGX_0"]
    }
    

    Update the firmware:

    nvfwupd -t ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> update_fw \
      -p nvfw_HGX_DGXH100_231101.1.0.fwpkg -y -s gpu_tray.json
    

    This step performs parallel updates on all the components contained in the GPU tray, such as VBIOS, NVSwitch, EROTs, and FPGA.

  7. Perform an AC power cycle on the system by unplugging all the power supplies and then reconnecting them either manually or through an external PDU device.

    Wait for the operating system to boot.

  8. Confirm the firmware update is complete by viewing the installed versions again:

    nvfwupd --target ip=<bmc-ip-address> user=<bmc-username> password=<bmc-password> \
      show_version -p nvfw_DGXH100_231206.1.0.fwpkg \
      nvfw_HGX_DGXH100_231101.1.0.fwpkg
    
  9. Update the firmware on the ConnectX-7 controllers.

    Update the firmware on the cards that are used for cluster communication:

    sudo mstflint -d /sys/bus/pci/devices/0000:5e:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:dc:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:c0:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:18:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:40:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:4f:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:ce:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:9a:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX750500B-0D00_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    

    Update the firmware on the cards that are used for storage communication:

    sudo mstflint -d /sys/bus/pci/devices/0000:aa:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX755206AS-NEA_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b
    sudo mstflint -d /sys/bus/pci/devices/0000:29:00.0/config -i fw-ConnectX7-rel-28_39_1002-MCX755206AS-NEA_Ax-UEFI-14.32.12-FlexBoot-3.7.201.signed.bin  b