DGX A100 System Firmware Update Container Version 21.05.7

The DGX Firmware Update container version 21.05.7 is available.

  • Package name:nvfw-dgxa100_21.05.7_210519.tar.gz
  • Run file name: nvfw-dgxa100_21.05.7_210519.run
  • Image name: nvfw-dgxa100:21.05.7
  • ISO image: DGXA100_FWUI-21.05.7-2021-05-19-07-26-16.iso
  • PXE netboot: pxeboot-dgxa100-21.05.7.tgz

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -
    • DGX OS 5.0 or later
      Important: This firmware update container does not support DGX OS 4.99.xx. To use the container on DGX A100 servers, update to DGX OS 5.0 or later.
    • EL7-21.01 or later (See Special Instructions for Red Hat Enterprise Linux)
    • EL8-20.11 or later
  • Eliminated the need for the workaround when updating the CEC 1712 SPI from 3.5 to 3.9.
  • Fixed 0W reporting issue with Delta PSU.

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table. The update time for each component is provided for reference. Total update time if all components are updated is approximately 2 hours and 20 minutes.

Component Version Key Changes Update Time
BMC (via CEC) 00.14.17 See DGX A100 BMC Changes 25 minutes
SBIOS 0.34 No change 7 minutes
Broadcom 88096 PCIe switch board 0.2.0 No change 8 minutes
BMC CEC SPI v3.28 No change 8 minutes
PEX88064 Retimer 1.2f No change 7 minutes
PEX88080 Retimer 1.2f No change 7 minutes
NvSwitch BIOS 92.10.18.00.01 No change 8 minutes
VBIOS (A100 40GB) 92.00.36.00.04 No change 7 minutes
VBIOS (A100 80GB) 92.00.36.00.01 No change
U.2 NVMe (Samsung) EPK99B5Q No change 6 minutes
U.2 NVMe (Kioxia) 0105 No change
M.2 NVMe (Samsung) EDA7602Q No change 3 minutes
FPGA (GPU sled) 2.A5 No change 40 minutes
CEC1712 SPI (GPU sled) 3.9 No change 7 minutes
PSU (Delta) Primary 1.6/ Secondary 1.6/ Community 1.7 Added to container 90 minutes

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
  • SBIOS: The two images are referred to as active and inactive, where the active is the currenly running image and the inactive is the backup image. When using update_fw all, the update container updates both active and inactive images.
  • BMC: The two images are referred to as active and inactive, where the active is the currenly running image and the inactive is the backup image. The update container can only update the inactive image, and will update it only if the active image needs to be updated. After the update is completed, the updated inactive image becomes the active image. Because the active image is now updated, subsequent update_fw all commands will not update the inactive image. To update the inactive image in this case, use update_fw BMC --inactive.

Updating the PSU FW

If one of the PSU firmware slots (primary, secondary, or community) is corrupted, updating the PSU firmware will fail if attempts are made to update other slots.

  • If you know the slot that is corrupted, then update that slot as follows (where <psu> is 0, 1, 2, 3, 4, or 5; <Slot> is Primary, Secondary, or Community);
    $ sudo ./nvfw-dgxa100_21.05.7_210519.run update_fw PSU -s <psu> -S <Slot> -f 
  • If you do not know which slot is corrupted, then use the SKIP_FAIL flag to update all three slots.
    $ sudo ./nvfw-dgxa100_21.05.7_210519.run set_flags SKIP_FAIL=1 update_fw PSU -s <psu> -f

    The FWUC may display a message about the PSU update failing in the non-corrupted slots, but the PSU should actually be recovered because the corrupted slot is successfully updated.

DO NOT UPDATE DGX A100 CPLD FIRMWARE UNLESS INSTRUCTED

When updating DGX A100 firmware using the Firmware Update Container, do not update the CPLD firmware unless the DGX A100 system is being upgraded from 320GB to 640GB.

The current DGX A100 Firmware Update Container will not automatically update the CPLD firmware (for example, when running update_fw all). It is possible to update the CPLD firmware using “update_fw CPLD”; however, it is strongly recommended that the CPLD firmware not be updated manually unless specifically instructed by NVIDIA Enterprise Support (or email enterprisesupport@nvidia.com). If the DGX A100 is upgraded from 320GB to 640GB, the CPLD firmware update should be performed as instructed.

Special Instructions for Red Hat Enterprise Linux

This section describes the actions that must be taken before updating firmware on DGX A100 systems installed with Red Hat Enterprse Linux. There are two options for meeting these requirements.

Option 1: Update to EL7-21.01 or later

Refer to the DGX Software for Red Hat Enterprise Linux 7 Release Notes for more information.
Important: Updating the DGX software for Red Hat Enerprise Linux will update the Red Hat Enterprise Linux installation to 7.9 or later. If you do not want to update your Red Hat Enterprise Linux 7 installation, then choose Option 2.

Option 2: Install mpt3sas 31.101.01.00-0

These instructions apply if
  • You do not want to update your Red Hat Enterprise Linux installation, and
  • Your system is currently installed with Red Hat Enterprise Linux 7.7 or later.

    If your system is installed with Red Hat Enterprise Linux 7.6 or earlier, contact NVIDIA Enterprise Support for assistance.

  1. Perform this step if your system is no longer pointing to the NVIDIA DGX software repository.
    1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.
      sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
      sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
    2. Run the following command to install the DGX software installation package and enable the NVIDIA DGX software repository.

      Attention:By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.
      yum install -y \
      https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-20.03-1.el7.x86_64.rpm
  2. Install mpt3sas 31.101.01.00-0.
    sudo yum install mpt3sas-dkms
  3. Load the mpt3sas driver into the Red Hat Enterprise Linux kernel.
    sudo modprobe mpt3sas 
    You can verify the correct mpt3sas version is installed by issuing the following.
     yum list installed

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it. The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.
CAUTION:
  • Stop all unnecessary system activities before attempting to update firmware.
  • Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.
  • Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
  • When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.
  1. Check if updates are needed by checking the installed versions.
    $ sudo ./nvfw-dgxa100:21.05.7_210519.run show_version
    • If there is "no" in any up-to-date column for updatable firmware, then continue with the next step.
    • If all up-to-date column entries are "yes", then no updates are needed and no further action is necessary.
  2. Perform the update for all firmware supported by the container.
    $ sudo ./nvfw-dgxa100:21.05.7_210519.run update_fw all
    Depending on the firmware that is updated, you may be prompted to either reboot the system or power cycle the system.
    • If you are prompted to reboot, issue
      $ sudo reboot
    • If you are prompted to power cycle, you can issue the following two commands (there is no output with the first command).
      $ sudo ipmitool raw 0x3c 0x04
      $ sudo ipmitool chassis power cycle
  3. After rebooting or power cycling the system, you may need to perform another update_fw all to update other firmware.
    • Either repeat Step 1 to check if updates are needed and then perform Step 2 if needed, or
    • Repeat Step 2 just in case updates are needed.
    If you perform another update_fw all, you may be prompted again to either reboot the system or power cycle the system. See DGX A100 Firmware Update Process for more information about the update process.
You can verify the update by issuing the following.
$ sudo ./nvfw-dgxa100:21.05.7_210519.run show_version

Example output for a DGX A100 320GB system

 CEC
============
                                           Onboard Version   Manifest  up-to-date
MB_CEC(enabled)                             3.28              3.28         yes
Delta_CEC(enabled)                          3.09              3.09         yes

 BMC DGX
=========
Image Id              Status    Location    Onboard Version   Manifest  up_to_date
0:Active   Boot       Online    Local       00.14.17          00.14.17     yes
1:Inactive Updatable            Local       00.14.17          00.14.17     yes

 SBIOS
=======
Image Id                                    Onboard Version   Manifest    up_to_date
0:Inactive Updatable                        0.34              0.34          yes
1:Active   Boot Updatable                   0.34              0.34          yes

  Switches
============
PCI Bus#                  Model      Onboard Version Manifest FUB Updated? up-to-date
DGX - 0000:91:00.0(U261)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:88:00.0(U260)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:4f:00.0(U262)  88064_Retimer  1.2.0        1.2.0      N/A         yes
DGX - 0000:48:00.0(U225)  88080_Retimer  1.2.0        1.2.0      N/A         yes


DGX - 0000:01:00.0(U1)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:b1:00.0(U4)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:41:00.0(U2)    PEX88096       2.0         2.0       N/A         yes
DGX - 0000:81:00.0(U3)    PEX88096       2.0         2.0       N/A         yes


DGX - 0000:c4:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes
DGX - 0000:c5:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes
DGX - 0000:c8:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes
DGX - 0000:c6:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes
DGX - 0000:c9:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes
DGX - 0000:c7:00.0        LR10    92.10.18.00.01  92.10.18.00.01  yes      yes



 Mass Storage
==============
Drive Name/Slot         Model Number        Onboard Version   Manifest   up-to-date
nvme0n1          Samsung MZWLJ3T8HBLS-00007  EPK99B5Q         EPK99B5Q     yes
nvme1n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q         EDA7602Q     yes
nvme2n1          Samsung MZ1LB1T9HALS-00007  EDA7602Q         EDA7602Q     yes
nvme3n1          Samsung MZWLJ3T8HBLS-00007  EPK99B5Q         EPK99B5Q     yes
nvme4n1          Samsung MZWLJ3T8HBLS-00007  EPK99B5Q         EPK99B5Q     yes
nvme5n1          Samsung MZWLJ3T8HBLS-00007  EPK99B5Q         EPK99B5Q     yes

 Video BIOS
============
Bus            Model            Onboard Version   Manifest     FUB Updated? up-to-date
0000:07:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:0f:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:47:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:4e:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:87:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:90:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:b7:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes
0000:bd:00.0   A100-SXM4-40GB   92.00.36.00.04   92.00.36.00.04  yes        yes


FPGA
========
Onboard version     Manifest  up-to-date
02.a5               02.a5        yes

Power Supply
==============
                                                           Onboard
ID                  Vendor Model        MFR ID  Status   Version   Manifest  up-to-date
PSU 0: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 0: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 0: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 1: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 1: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 1: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 2: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 2: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 2: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 3: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 3: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 3: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 4: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 4: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 4: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes
PSU 5: Primary      Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 5: Secondary    Delta ECD16010092   Delta     ok       01.06     01.06      yes
PSU 5: Community    Delta ECD16010092   Delta     ok       01.07     01.07      yes

  CPLD
============
                                         Onboard Version   Manifest       up-to-date
MID_CPLD                                 1.03              1.03               yes
MB_CPLD                                  1.03              1.03               yes

* CPLD won't be updated by default (`update_fw all`), use `update_fw CPLD` if it's needed

Known Issues

BMC UI May not be Accessible from Mac OS

Issue

When attempting to connect to the DGX A100 BMC from a system with Mac OS, a "Your connection is not private" message appears that prevents access to the BMC.

Explanation

BMC version 0.13.6 provides a self-signed certificate which Mac OS flags in the browser. Most browsers will let you either accept the risk and continue, or add the certificate to the keychain and continue. The Chrome and Opera browsers, however, do not provide these options and so Mac OS users will not be able to access the BMC from the Chrome or Opera browser.

To access the DGX A100 BMC, Mac OS users can use Safari or Firefox, which provide an access path.

Boot Order in the SBIOS Reverts to the Default

Issue

After updating the SBIOS, any changes in the boot order that you have made are not preserved and the boot order specified in the SBIOS reverts to the following.

Boot Option #1  [Hard Disk]
Boot Option #2  [NVME]
Boot Option #3  [USB CD/DVD]
Boot Option #4  [USB Hard Disk]
Boot Option #5  [USB Key]
Boot Option #6  [Network]

Explanation

You will need to prepare for the change when restarting the system, and then enter the SBIOS setup to edit the boot order as needed.

Unable to Launch BMC Dashboard under Firefox

Issue

After updating the BMC from to 0.13.16, attempts to access the BMC dashboard fails with a "Secure Connection Fail" message.

Explanation

To work around, update Firefix to version 84.01 or later.

The system starts the POST process several times during boot after updating the SBIOS.

Issue

After updating the SBIOS and rebooting the system, the NVIDIA splash screen appears and disappears several times before boot is completed.

Explanation

After updating the SBIOS, several component states are cleared and it takes 3-4 reboots to reset all the components. This is expected behavior.