DGX A100 System Firmware Update Container Release Notes

This document describes the key features, software enhancements and improvements, and known issues for the NVIDIA DGX A100 System Firmware Update Container.

1. DGX A100 System FW Update Container Overview

The NVIDIA DGXTM A100 System Firmware Update container is the preferred method for updating firmware on DGX A100 system. It provides an easy method for updating the firmware to the latest released versions, and uses the standard method for running Docker containers.

This document describes firmware components that can be updated, any known issues, and how to run this container.

Features

  • Automates firmware (FW) update for DGX A100 system firmware, such as the system BIOS and BMC.
  • Provides flexibility to update individual or all FW components
  • Embeds the following
    • Qualified FW binaries for supported components

    • Flash update utilities and supporting dependencies

    • Manifest file which lists

      • Target platform and firmware version numbers

      • Sequence in which FW update should be applied

      • “On-Error” policy for every FW component

  • Supports interactive and non-interactive firmware update

How to Use

The NVIDIA DGX A100 system software includes Docker software required to run the container. The update container is also available as a run file that does not require a Docker installation.

CAUTION:
Stop all unnecessary system activities before attempting to update firmware, and do not add additional processing loads while an update is in progress. A high workload can disrupt the firmware update process and result in an incapacitated component.

When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.

Fan speeds may increase while updating the BMC firmware. This is a normal part of the BMC firmware update process.

2. DGX A100 System Firmware Update Container Version 20.05

The DGX Firmware Update container version 20.05.12 is available.

  • Package name:nvfw-dgxa100_20.05.12_200603.tar.gz
  • Run file name: nvfw-dgxa100_20.05.12_200603.run
  • Image name: nvfw-dgxa100:20.05.12

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -
    • DGX OS 4.99.8 or later
  • Enabled BMC Secure Flash
  • Enabled PCI-Compliant DPC and AER error propagation
  • Implemented critical VBIOS fixes

Contents of the DGX A100 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component Version Key Changes Update Time
BMC (via CEC) 00.12.05 Added to container.
  • BMC now recognizes the level of CEC installed, and enforces Secure Flash if the CEC supports it.
  • Removed the ability to update the BMC via the UI.
  • Added micro-controller assist (MCA) SEL, downloadable from the UI.
  • Added Logs & Reports > Debug Log > Download Debug log control to BMC UI.
31 minutes
SBIOS 0.23 Added to container
  • Removed Hidden Options and made TPM Configuration options visible
  • Fixed NVSM Show Health Errors related to DIMMs and DIMM population
  • Fixed system getting stuck at POST after enabling and then disabling drive encryption
7 minutes
Broadcom 88096 PCIe switch board 1.3 Added to container
  • Disabled hot-plug and hot-plug surprise capability
8 minutes
BMC CEC SPI v3.05 Added to container 8 minutes
PEX88064 Retimer 0.1.13 Updated 7 minutes
PEX88080 Retimer 0.1.13 Updated 7 minutes
NvSwitch BIOS 92.10.12.00.01 No change 8 minutes
VBIOS 92.00.19.00.01 Updated
  • Fixed Xid 64 (Row Remapper Error)
7 minutes

2.1. Instructions for Updating Firmware

This section provides a simple way to update the firmware on the system using the firmware update container. It includes instructions for performing a transitional update for systems that require it.
  1. Check if updates are needed.
    $ sudo nvfw-dgxa100_20.05.12_200603.run show_version
    • If there is "no" in any up-to-date column, then continue with the next step.
    • If all up-to-date column entries are "yes", then no updates are needed and no further action is necessary.
  2. Check if the transitional update is needed.

    Depending on the BMC and MB_CEC versions on the system, you may need to perform a transitional update before updating the BMC and SBIOS to the latest versions.

    $ nvfw-dgxa100_20.05.12_200603.run run_script --command "fw_transition.py show_version" 

    The following message appears if a transition update is needed.

    BMC/MB_CEC firmware needs update to Active/Inactive, secure boot mode 
    This is a one-time update required for DGXA100. All future updates require BMC in this mode
    • If the one-time update is required, continue with step 3 to perform the transitional update.
    • The the one-time update is not required, then skip to step 4.
  3. Perform the transitional update.
    1. Issue the following.
      $ nvfw-dgxa100_20.05.12_200603.run run_script --command "fw_transition.py update_fw" 
      $ sudo reboot
    2. Verify that BMC (both images) and the MB_CEC are up to date.
      $ nvfw-dgxa100_20.05.12_200603.run run_script --command "fw_transition.py show_version"
      If the inactive BMC needs to be updated, then issue the following.
      $ nvfw-dgxa100_20.05.12_200603.run update_fw BMC --inactive
  4. Perform the final update for all firmware supported by the container and reboot the system.
    $ nvfw-dgxa100_20.05.12_200603.run update_fw all
    
    $ sudo reboot
    Note: The update_fw all command updates the active images only.
You can verify the update by issuing the following.
$ sudo nvfw-dgxa100_20.05.12_200603.run show_version

Expected output.

Note: The onboard version for the inactive SBIOS may not be up to date. Do not attempt to update the inactive SBIOS. See the Known Issues section.
 BMC DGX 
=========
Image Id             Status  Location  Onboard Version  Manifest   up_to_date
0:Active   Boot      Online   Local     00.12.05        00.12.05         yes
1:Inactive Updatable          Local     00.12.05        00.12.05         yes

  CEC
============
                                     Onboard Version   Manifest    up-to-date      
MB_CEC(enabled)                       3.05              3.05             yes         
 
 SBIOS
=======
Image Id                   Method    Onboard Version   Manifest    up_to_date   
0:Inactive Updatabl        afulnx     0.18              0.23             no       
1:Active   Boot                       0.23              0.23             yes       
 
 Video BIOS
============
Bus            Model                 Onboard Version    Manifest    up-to-date
0000:07:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:0f:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:47:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:4e:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:87:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:90:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:b7:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes
0000:bd:00.0   A100-SXM4-40GB        92.00.19.00.01     92.00.19.00.01    yes

  Switches
============
PCI Bus#                  Model       Onboard Version   Manifest     up-to-date
DGX - 0000:91:00.0(U261)  88064_Retimer  0.13.0          0.13.0            yes
DGX - 0000:88:00.0(U260)  88064_Retimer  0.13.0          0.13.0            yes
DGX - 0000:4f:00.0(U262)  88064_Retimer  0.13.0          0.13.0            yes

DGX - 0000:48:00.0(U225)  88080_Retimer  0.13.0          0.13.0            yes

DGX - 0000:c4:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c5:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c2:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c6:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c3:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes
DGX - 0000:c7:00.0        LR10        92.10.12.00.01    92.10.12.00.01     yes

DGX - 0000:01:00.0(U1)    PEX88096        1.3               1.3            yes
DGX - 0000:81:00.0(U3)    PEX88096        1.3               1.3            yes
DGX - 0000:41:00.0(U2)    PEX88096        1.3               1.3            yes
DGX - 0000:b1:00.0(U4)    PEX88096        1.3               1.3            yes

Known Issues

Inactive SBIOS Cannot be Updated

Issue

There are two SBIOS images. When updating the SBIOS, only the active SBIOS image should be updated. Updating the inactive SBIOS (update_fw SBIOS -f --inactive) will result in the system getting stuck at POST during reboot.

Explanation

This is a limitation that will be resolved in a later software release.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, and DGX Station are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.