DGX-2 System Firmware Update Container Version 21.06.7

The DGX Firmware Update container version 21.06.7 is available.

  • Package name:nvfw-dgx2_21.06.7_210616.tar.gz
  • Run file name: nvfw-dgx2_21.06.7_210616.run
  • Image name: nvfw-dgx2:21.06.7
Important: NVIDIA strongly advises updating the BMC if the installed BMC is version 01.05.07

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -
    • DGX OS 4.7 or later
    • EL7-21.01 or later
    • EL8-20.11 or later
  • The container now checks the Intel Management Engine (ME) firmware version and ME Setting version to determine when to update the ME firmware. The ME version is now included in the show_fw_manifest output.
  • Utilizes mlnx-fw-updater package to update the ConnectX cards. If the mlnx-fw-updater package is installed, then the ConnectX card firmware will be updated during update_fw all to ensure that all cards' firmware are updated to the same level as the newest installed firmware.
  • GPU driver is now prevented from inadvertently interfering with VBIOS updates.
  • NVIDIA services no longer need to be manually stopped before updating firmware on systems installed with DGX OS 5 or later.

Contents of the DGX-2 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component Version Key Changes
BMC 1.07.02
Note: Refer to the instructions in section Special Instructions to determine applicable actions to take.
See DGX-2 BMC Changes
SBIOS 0.29 See DGX-2 SBIOS Changes
M.2 NVMe (Samsung PM963) CXV8601Q No change
U.2 SSD (Micron) 101008S0 No change
VBIOS (DGX-2) 88.00.6B.00.01 No change
VBIOS (DGX-2H) 88.00.6B.00.08 No change
PSU 3.1
Note: Refer to the instructions in section Special Instructions to determine applicable actions to take.
No change
FPGA 3.1 No change

Note: There are two FPGA images - Image-1:Rescue and Image-2:Primary. The Firmware Update Container updates the Primary FPGA image only.

Change to the Update Process

Originally, only certain firmware components, such as the SBIOS, required power cycling the system after performing the update. In order to ensure that all DGX-2 services continue running, you must power cycle the DGX-2 after any firmware update for any component or group of components.

The addition of Intel ME update capability results in the need to run update_fw all twice when updating all firrmware components. Refer to Instructions for Updating Firmware for detailed instructions.

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
  • SBIOS: Only the primary image is updated. To update both images, follow the instructions at Special Instructions for PSU, SBIOS, and BMC Firmware Updates.
  • BMC: Only the primary image is updated. To update the secondary (backup) image, include the --inactive option in the update command.
  • FPGA: Only the primary image is updated.

Enabling SNMP RO/RW Strings

The SNMP RO/RW strings are disabled by default. The following table provides the ipmitool arguments for enabling the strings. After enabling, disabling, or setting the RO/RW strings, either issue the restart SNMP Server command or reset the BMC for the changes to go into effect.

LUN Cmd Requested Data
Offset Description
3Ch/00h 26h 1

00h: Enable RO string

01h: Enable RW string

02h: Disable RO string

03h: Disable RW string

04h: Set RO string

05h: Set RW string

06h: Start SNMP Server

07h: Stop SNMP Server

08h: Restart SNMP Server

2:21 Community string in ASCII code. Maximum string length is 20 characters. If request byte is set to 0x4 or 0x5, but empty from byte 2 to byte 21, then the corresponding community string will be cleared.
For example, to enable the RO String, set the Community to "test", and then restart the SNMP service on the BMC as follows:
  1. Enable RO.
    $ sudo ipmitool raw 0x3c 0x26 0x00
  2. Set the RO string to "test".
    $ sudo ipmitool raw 0x3c 0x26 0x04 0x74 0x65 0x73 0x74 
  3. Restart the SNMP service on the BMC.
    $ sudo ipmitool raw 0x3c 0x26 0x08

Special Instructions for PSU, SBIOS, and BMC Firmware Updates

  • Before updating the PSU, SBIOS, or the BMC, refer to the following special instructions for guidance to ensure the updates are successful.

PSU Updates

SBIOS Updates

  • If the installed BMC is version 1.05.7, then update the BMC first before updating the SBIOS.
  • To update both primary and secondary SBIOS (after updating the BMC) using the container, do the following (assumes the primary SBIOS is the current, active SBIOS).
    1. Refer to Special Instructions for all Updates to see if services need to be stopped and how to do it.
    2. Update the active SBIOS using the update_fw SBIOS argument from the firmware update container.
    3. Designate booting from the secondary (inactive) SBIOS on the next boot.
      $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot
    4. Power cycle the DGX-2 to switch to the secondary SBIOS.
      $ sudo telinit 1
      $ sudo umount /raid
      $ sync
      $ sudo ipmitool chassis power cycle
    5. Update the secondary (now active) SBIOS.
    6. Designate booting from the primary SBIOS on the next boot (to restore the primary SBIOS as the active SBIOS).
      $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot
    7. Power cycle the DGX-2 to switch back to the primary SBIOS.
      $ sudo telinit 1
      $ sudo umount /raid
      $ sync
      $ sudo ipmitool chassis power cycle

BMC Updates

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it. The commands use the .run file, but you can also use any method described in Using the DGX-2 FW Update Utility.
CAUTION:
  • Stop all unnecessary system activities before attempting to update firmware.
  • Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.
  • Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
  • When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.
  1. Check if updates are needed by checking the installed versions.
    $ sudo ./nvfw-dgx2_21.06.7_210616.run show_version
    • If there is "no" in any up-to-date column for updatable firmware, then continue with the next step.
    • If all up-to-date column entries are "yes", then no updates are needed and no further action is necessary.
  2. Begin the process of updating all the firmware supported by the container.
    $ sudo ./nvfw-dgx2_21.06.7_210616.run update_fw all
    You will be prompted to power cycle the server.
  3. Power cycle the server.
    $ sudo ipmitool chassis power cycle
  4. After power cycling the system, perform another update_fw all to update the Intel ME.
    Note: Since the Intel ME is part of the SBIOS, the container messaging may indicate that the SBIOS is getting updated. This is expected.
  5. Perform another power cycle.
    $ sudo ipmitool chassis power cycle

    See DGX-2 Firmware Update Process for more information about the update process.

You can verify the update by issuing the following.
$ sudo ./nvfw-dgx2_21.06.7_210616.run show_version

Known Issues

EEPROM Checksum Mismatch

Issue

BMC version 1.05.7 introduced an issue that could cause corruption in the BMC EEPROM. This is indicated by an EEPROM checksum mismatch error message when attempting to update any firmware.

You can also verify EEPROM corruption by issuing the following
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 show_version
and then viewing the output for the error message.
Note: This error may be reported if a corrupt SBIOS produces a watchdog timeout during boot. In this case, the error message is erroneous. See the section Watchdog Timeout Due to Corrupt SBIOS for instructions on confirming and then resolving the SBIOS corruption.

Resolution

The DGX-2 Firmware Update Container version 21.06.7 includes logic to detect and repair the corruption. Perform the following steps to repair the EEPROM corruption.
  1. If the BMC is not already updated, then update the BMC.

    Refer to Special Instructions for all Updates to see if services need to be stopped and how to do it.

  2. Review the "current" and "next" boot SBIOS by issuing the following.
    $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --get-nextboot-slot
  3. Perform actions based on the NextBoot and Currently Booted from slots
    • If the NextBoot slot and Currently Booted From slot are different, then reboot the system using ipmitool.
      $ telinit 1
      $ umount /raid
      $ sync
      $ ipmitool chassis power cycle
      
    • If the NextBoot slot and Currently Booted From slot are the same, then switch the NextBoot slot and then reboot as follows.
      $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot 
      $ telinit 1
      $ umount /raid
      $ sync
      $ ipmitool chassis power cycle
      
  4. Switch the NextBoot slot again and reboot to return to the original SBIOS.
    $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot 
    $ telinit 1
    $ umount /raid
    $ sync
    $ ipmitool chassis power cycle
    
  5. Verify the version strings in the primary and secondary slots are restored to their correct values.
     $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 show_version

Watchdog Timeout Due to Corrupt SBIOS

Issue

If an SBIOS is corrupt, the system will not be able to boot from it. In this case, when attempting to boot from the corrupt SBIOS, a watchdog timeout occurs and then the system boots from the alternate SBIOS. If the system is then rebooted, the system will attempt to boot from the original SBIOS, timeout again, then boot from the alternate SBIOS.

To confirm that a watchdog timeout has occurred,
  1. Issue the following.
    $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:20.10.7 show_version 
    $ sudo cat /var/log/nvidia-fw.log | grep "EEPROM detection status 1" -n1 
  2. Inspect byte 14 from the last EEPROM struct entry in the output.

    If byte 14 (bold-italicized in the following example) is 01, then a watchtdog timeout has occurred.

    {EEPROM struct :00 00 16 00 00 18 00 01 03 01 03 01 22 01 01 a5}

Resolution

If the SBIOS is corrupted, you can re-flash the SBIOS from the BMC dashboard. See Updating the SBIOS from the BMC Dashboard for instructions.

VBIOS Not Updated on DGX KVM Host

Issue

On a DGX-2 System that has been converted to a DGX KVM host, the VBIOS will not get updated if the GPU is being used by a guest GPU VM.

Explanation

All guest GPU VMs must be stopped before running the container to update the VBIOS. To stop the VMs, run the following from the KVM host for each guest GPU VM.

virsh shutdown <vm-domain>