DGX A100 System Firmware Update Container Version 20.11.3

The DGX Firmware Update container version 20.11.3 is available.

  • Package name:nvfw-dgxa100_20.11.3_201124.tar.gz
  • Run file name: nvfw-dgxa100_20.11.3_201124.run
  • Image name: nvfw-dgxa100:20.11.3

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -
    • DGX OS 4.99.11
    • DGX OS 5.0 or later
  • Includes firmware updates to resolve PCIe advanced error reporting (AER) issues.
  • The BMC update includes software security enhancements.

    See the NVIDIA Security Bulletin 5010 for details.

  • Changed the policy for updating the active/inactive BMC images. See Updating Components with Secondary Images.
  • Removed the need to manually stop certain services before updating on DGX OS 5.0.
  • See DGX A100 System Firmware Changes for the list of changes in individual components.

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table. The update time for each component is provided for reference. Total update time if all components are updated is approximately 2 hours and 20 minutes.

Component Version Key Changes Update Time
BMC (via CEC) 00.13.16 The BMC update includes software security enhancements.

See the NVIDIA Security Bulletin 5010 for details.

25 minutes
SBIOS 0.30 See SBIOS Release Notes 7 minutes
Broadcom 88096 PCIe switch board 0.1.8 Updated preset values to address PCIe advanced error reporting (AER) issues. 8 minutes
BMC CEC SPI v3.25 Improved BMC update time and reliability. 8 minutes
PEX88064 Retimer 0.F.0 Improved error handling of downstream switches. 7 minutes
PEX88080 Retimer 0.F.0 Improved error handling of downstream switches. 7 minutes
NvSwitch BIOS 92.10.14.00.01 Added support for a new out-of-band SMBPBI query to retrieve FUB revocation status. 8 minutes
VBIOS (A100 40GB) 92.00.19.00.10 Improved VBIOS compatibility. 7 minutes
VBIOS (A100 80GB) 92.00.36.00.01 New addition to the container.
U.2 NVMe (Samsung) EPK99B5Q Enabled Relaxed Ordering. 6 minutes
FPGA (GPU sled) 2.9c Added to the container. Implements miscellaneous bug fixes. 40 minutes
CEC1712 SPI (GPU sled) 3.5 Added to the container. Improved update time and reliability. 7 minutes

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
  • SBIOS: The two images are referred to as active and inactive, where the active is the currenly running image and the inactive is the backup image. When using update_fw all, the update container updates both active and inactive images.
  • BMC: The two images are referred to as active and inactive, where the active is the currenly running image and the inactive is the backup image. The update container can only update the inactive image, and will update it only if the active image needs to be updated. After the update is completed, the updated inactive image becomes the active image. Because the active image is now updated, subsequent update_fw all commands will not update the inactive image. To update the inactive image in this case, use update_fw BMC --inactive --force.

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it. The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.
CAUTION:
Stop all unnecessary system activities before attempting to update firmware, and do not add additional loads on the system (such as Kubernetes jobs or other user jobs or diagnostics) while an update is in progress. A high GPU workload can disrupt the firmware update process and result in an unusable component.
  1. Check if updates are needed by checking the installed versions.
    $ sudo nvfw-dgxa100_20.11.3_201124.run show_version
    • If there is "no" in any up-to-date column for updatable firmware, then continue with the next step.
    • If all up-to-date column entries are "yes", then no updates are needed and no further action is necessary.
  2. Perform the update for all firmware supported by the container.
    $ sudo nvfw-dgxa100_20.11.3_201124.run update_fw all
    Depending on the firmware that is updated, you may be prompted to either reboot the system or power cycle the system.
    • If you are prompted to reboot, issue
      $ sudo reboot
    • If you are prompted to power cycle, you can issue the following two commands (there is no output with the first command).
      $ sudo ipmitool raw 0x3c 0x04
      $ sudo ipmitool chassis power cycle
  3. After rebooting or power cycling the system, you may need to perform another update_fw all to update other firmware.
    • Either repeat Step 1 to check if updates are needed and then perform Step 2 if needed, or
    • Repeat Step 2 just in case updates are needed.
    If you perform another update_fw all, you may be prompted again to either reboot the system or power cycle the system. See DGX A100 Firmware Update Process for more information about the update process.
  4. Rename the firmware update log file (the update generates /var/log/nvidia-fw.log which you should rename).

    Example:

    $ sudo mv /var/log/nvidia-fw.log /var/log/nvidia-fw-large.log
    Refer to Firmware Update Log File Size Impacts nvsm dump health for more information.
You can verify the update by issuing the following.
$ sudo nvfw-dgxa100_20.11.3_201124.run show_version

Example output for a DGX A100 320GB system

 BMC DGX 
=========
Image Id             Status  Location  Onboard Version  Manifest   up_to_date
0:Active   Boot      Online   Local     00.13.16        00.13.16         yes
1:Inactive Updatable          Local     00.13.16        00.13.16         yes

  CEC
============
                                     Onboard Version   Manifest    up-to-date      
MB_CEC(enabled)                       3.25              3.25             yes         
 
 SBIOS
=======
Image Id                   Method    Onboard Version   Manifest    up_to_date   
0:Inactive Updatabl        afulnx     0.30              0.30             yes       
1:Active   Boot                       0.30              0.30             yes       
 
 Video BIOS
============
Bus            Model                 Onboard Version    Manifest    up-to-date
0000:07:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:0f:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:47:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:4e:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:87:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:90:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:b7:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes
0000:bd:00.0   A100-SXM4-40GB        92.00.19.00.10     92.00.19.00.10    yes

  Switches
============
PCI Bus#                  Model       Onboard Version   Manifest     up-to-date
DGX - 0000:91:00.0(U261)  88064_Retimer  0.F.0          0.F.0            yes
DGX - 0000:88:00.0(U260)  88064_Retimer  0.F.0          0.F.0            yes
DGX - 0000:4f:00.0(U262)  88064_Retimer  0.F.0          0.F.0            yes

DGX - 0000:48:00.0(U225)  88080_Retimer  0.F.0          0.F.0            yes

DGX - 0000:c4:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes
DGX - 0000:c5:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes
DGX - 0000:c2:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes
DGX - 0000:c6:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes
DGX - 0000:c3:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes
DGX - 0000:c7:00.0        LR10        92.10.14.00.01    92.10.14.00.01     yes

DGX - 0000:01:00.0(U1)    PEX88096        1.8               1.8            yes
DGX - 0000:81:00.0(U3)    PEX88096        1.8               1.8            yes
DGX - 0000:41:00.0(U2)    PEX88096        1.8               1.8            yes
DGX - 0000:b1:00.0(U4)    PEX88096        1.8               1.8            yes

Known Issues

Firmware Update Log File Size Impacts nvsm dump health

Issue

After running the container, the generated log file (/var/log/nvidia-fw.log) can grow to up to tens of gigabytes in size, depending on the firmware that gets updated. If, at a later time, you run nvsm dump health, the command might time out and fail if the file size is too large.

Explanation

To avoid problems running nvsm dump health, rename the generated firmware update log file after updating the firmware.

Example:

$ sudo mv /var/log/nvidia-fw.log /var/log/nvidia-fw-large.log

BMC UI May not be Accessible from Mac OS

Issue

When attempting to connect to the DGX A100 BMC from a system with Mac OS, a "Your connection is not private" message appears that prevents access to the BMC.

Explanation

BMC version 0.13.6 provides a self-signed certificate which Mac OS flags in the browser. Most browsers will let you either accept the risk and continue, or add the certificate to the keychain and continue. The Chrome and Opera browsers, however, do not provide these options and so Mac OS users will not be able to access the BMC from the Chrome or Opera browser.

To access the DGX A100 BMC, Mac OS users can use Safari or Firefox, which provide an access path.

Update Timeout Reported for Motherboard CEC

Issue

The update progress output reports "Update_timeout" for the motherboard CEC (MB_CEC) when using the .run file without Docker installed.

Example

+----------------------------------------------------------------------------------+
|+--------------------------------------------------------------------------------+|
||                   !!!!! Firmware Update In Progress !!!!!                      ||
||                        Status: reflash BMC firmware                            ||
|+--------------------------------------------------------------------------------+|
|                                 Onboard         Manifest          Update Status  |
|MB_CEC                           3.05            3.25              Update timeout |

Explanation

This message can be ignored provided that the MB_CEC update is successful.

Example success message:

Success:
  Installed firmware 3.25 on MB_CEC

Boot Order in the SBIOS Reverts to the Default

Issue

After updating the SBIOS, any changes in the boot order that you have made are not preserved and the boot order specified in the SBIOS reverts to the following.

Boot Option #1  [Hard Disk]
Boot Option #2  [NVME]
Boot Option #3  [USB CD/DVD]
Boot Option #4  [USB Hard Disk]
Boot Option #5  [USB Key]
Boot Option #6  [Network]

Explanation

You will need to prepare for the change when restarting the system, and then enter the SBIOS setup to edit the boot order as needed.