DGX A100 System Firmware Update Container Version 20.05.12.3

The DGX Firmware Update container version 20.05.12.3 is available.

  • Package name: nvfw-dgxa100_20.05.12.3_200716.tar.gz

  • Run file name: nvfw-dgxa100_20.05.12.3_200716.run

  • Image name: nvfw-dgxa100:20.05.12.3

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -

  • Fixed an issue where the DGX A100 fans would run at high speed when the optional dual-port network card was not installed.

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component

Version

Key Changes

Update Time

BMC (via CEC)

00.12.06

Fixed high fan speed bug.

31 minutes

SBIOS

0.25

No change

7 minutes

Broadcom 88096 PCIe switch board

1.3

No change

8 minutes

BMC CEC SPI

v3.05

No change

8 minutes

PEX88064 Retimer

0.13.0

No change

7 minutes

PEX88080 Retimer

0.13.0

No change

7 minutes

NvSwitch BIOS

92.10.12.00.01

No change

8 minutes

VBIOS

92.00.19.00.01

No change

7 minutes

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:

  • SBIOS: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image. The update container can only update the inactive image. After reboot, the updated image becomes the active image. You can perform the update again to update the current inactive image so that both images are updated.

  • BMC: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image. The update container can only update the inactive image. After the update is completed, the updated image becomes the active image. You can perform the update again to update the current inactive image so that both images are updated.

Updating Firmware on DGX Systems Installed with DGX OS Release 5.0 or Later

You need to stop certain NVIDIA services before using the container to update firmware on systems installed with DGX OS 5.0.x or later

  • If you run the container using either the docker run or .run file method, then stop services first by issuing the following.

    $ sudo systemctl stop nvsm dcgm nvidia-fabricmanager nvidia-persistenced.service
    
  • If you run the container using NVSM CLI, then stop services first by issuing the following (does not include stopping nvsm).

    $ sudo systemctl stop dcgm nvidia-fabricmanager nvidia-persistenced.service
    

Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it. The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.

Caution

Stop all unnecessary system activities before attempting to update firmware, and do not add additional loads on the system (such as Kubernetes jobs or other user jobs or diagnostics) while an update is in progress. A high GPU workload can disrupt the firmware update process and result in an unusable component.

  1. Perform a transitional update if needed.

    Depending on the BMC and MB_CEC versions on the system, you may need to perform a transitional update before updating the BMC and SBIOS to the latest versions.

    1. Check if the transitional update is needed.

      $ sudo nvfw-dgxa100_20.05.12.3_200716.run run_script --command "fw_transition.py show_version"
      

      The following message appears if a transition update is needed.

      BMC/MB_CEC firmware needs update to Active/Inactive, secure boot mode
      This is a one-time update required for DGXA100. All future updates require BMC in this mode
      
      • If the one-time update is required, continue with the next step to perform the transitional update.

      • If the one-time update is not required, then skip to step 2.

    2. Refer to Updating Firmware on DGX Systems Installed with DGX OS Release 5.0 or Later to see if services need to be stopped and how to do it.

    3. Perform the transitional update.

      $ sudo nvfw-dgxa100_20.05.12.3_200716.run run_script --command "fw_transition.py update_fw"
      $ sudo reboot
      
    4. Verify that BMC (both images) and the MB_CEC are up to date.

      $ sudo nvfw-dgxa100_20.05.12.3_200716.run run_script --command "fw_transition.py show_version"
      
  2. Check if other updates are needed by checking the installed versions.

    $ sudo nvfw-dgxa100_20.05.12.3_200716.run show_version
    
    • If there is “no” in any up-to-date column for updatable firmware, then continue with the next step.

    • If all up-to-date column entries are “yes”, then no updates are needed and no further action is necessary.

  3. Perform the final update for all firmware supported by the container and reboot the system.

    1. Refer to Updating Firmware on DGX Systems Installed with DGX OS Release 5.0 or Later to see if services need to be stopped and how to do it.

    2. Perform the update.

      $ sudo nvfw-dgxa100_20.05.12.3_200716.run update_fw all
      
      $ sudo reboot
      

    Note

    The update_fw all command updates the inactive BMC and SBIOS images only. After rebooting the system, the updated images become “active”. You can then update the inactive images using nvfw-dgxa100_20.05.12.3_200716.run update_fw [BMC] [SBIOS] as needed.

You can verify the update by issuing the following.

$ sudo nvfw-dgxa100_20.05.12.3_200716.run show_version

Expected output.

 BMC DGX
=========
Image Id             Status  Location  Onboard Version  Manifest   up_to_date
0:Active   Boot      Online   Local     00.12.06        00.12.06         yes
1:Inactive Updatable          Local     00.12.06        00.12.06         yes

  CEC
============
                                     Onboard Version   Manifest    up-to-date
MB_CEC(enabled)                       3.05              3.05             yes

 SBIOS
=======
Image Id                   Method    Onboard Version   Manifest    up_to_date
0:Inactive Updatabl        afulnx     0.25              0.25             yes
1:Active   Boot                       0.25              0.25             yes

 Video BIOS
============
Bus            Model                 Onboard Version    Manifest    up-to-date
0000:07:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:0f:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:47:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:4e:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:87:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:90:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:b7:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:bd:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes

  Switches
============
PCI Bus#                  Model       Onboard Version   Manifest     up-to-date
DGX - 0000:91:00.0(U261)  88064_Retimer  0.13.0          0.13.0            yes
DGX - 0000:88:00.0(U260)  88064_Retimer  0.13.0          0.13.0            yes
DGX - 0000:4f:00.0(U262)  88064_Retimer  0.13.0          0.13.0            yes

DGX - 0000:48:00.0(U225)  88080_Retimer  0.13.0          0.13.0            yes

DGX - 0000:c4:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c5:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c2:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c6:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c3:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c7:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes

DGX - 0000:01:00.0(U1)    PEX88096        1.3               1.3            yes
DGX - 0000:81:00.0(U3)    PEX88096        1.3               1.3            yes
DGX - 0000:41:00.0(U2)    PEX88096        1.3               1.3            yes
DGX - 0000:b1:00.0(U4)    PEX88096        1.3               1.3            yes