DGX A100 System Firmware Update Container Version 22.5.5
The DGX Firmware Update container version 22.5.5 is available.
Package name:
nvfw-dgxa100_22.5.5_220518.tar.gz
Run file name:
nvfw-dgxa100_22.5.5_220518.run
Image name:
nvfw-dgxa100:22.5.5
ISO image:
DGXA100_FWUI-22.5.5-2022-05-19-00-23-59.iso
PXE netboot:
pxeboot-DGXA100_FWUI-22.5.5.tgz
Highlights and Changes in this Release
This release is supported with the following DGX OS software:
DGX OS 5.1 or later.
Important
This firmware update container does not support DGX OS 4.99.xx. To use the container on DGX A100 servers, update to DGX OS 5.1 or later.
EL7-21.10 or later (See Special Instructions for Red Hat Enterprise Linux 7)
EL8-21.08 Update 1 or later
Fixed BMC issues
Fixed an issue so that certain sensors are now displaying in the BMC Web UI.
Fixed the graceful handling of system power loss, which prevents the BMC Flash file system consistency issue and improves recovery.
Fixed issues that caused the BMC usage to dramatically increase, which resulted in a POST failure with error code 91 or B4.
This fix also improves the error handling in the Redfish interface.
Fixed the BMC Web UI security settings and page refresh during full screen mode.
Fixed BMC SEL Event page, which was causing an error in certain SEL record parsing.
Fixed an issue where the Power/Status LED was flashing continuously after the server was rebooted, and the Power/Status LED stayed on after the server was powered off.
Added Redfish API support.
For more information, see Redfish API support in the DGX A100 User Guide.
For a list of known issues, see Known Issues.
Fixed SBIOS issues
Fixed two issues that were causing boot order settings to not be saved to the BMC if applied out-of-band, causing settings to be lost after a subsequent firmware update.
Added interactive countdown messages during boot, to display the Setup Prompt Timeout configurable through the**Boot** > Setup Prompt Timeout configuration menu.
Added reporting of AGESA Version in SMBIOS.
Updated AGESA to version 1.0.0.D.
Contents of the DGX A100 System Firmware Container
This container includes the firmware binaries and update utilities for the firmware listed in the following table.
If you are updating from 21.11.4 to 22.5.5, the total update time is approximately **1 hour and 3 minutes**.
If you are updating from 21.03.6 or earlier to 22.5.5, the total update time is approximately 2 hours and 51 minutes.
The update time for each component is provided in the following table.
Component |
Version |
Key Changes |
Update Time from 21.03.6 or earlier |
Update Time from 21.11.4 |
---|---|---|---|---|
BMC (via CEC) |
00.17.07 |
Refer to DGX A100 BMC Changes for the list of changes. |
31 minutes |
31 minutes |
SBIOS |
1.13 |
Refer to DGX A100 SBIOS Changes for the list of changes. |
6 minutes |
6 minutes |
Broadcom 88096 PCIe switch board |
0.2.0 |
No change |
1 minute |
0 minute |
BMC CEC SPI ( |
3.28 |
No change |
7 minutes |
0 minutes |
PEX88064 Retimer |
3.1.0 |
New support |
1 minute |
1 minutes |
PEX88080 Retimer |
3.1.0 |
New support |
1 minute |
1 minutes |
NvSwitch BIOS |
92.10.18.00.01 |
No change |
2 minutes |
0 minutes |
VBIOS (A100 40GB) |
92.00.45.00.03 |
Added security protection to the I2C interface. |
2 minutes |
0 minutes |
VBIOS (A100 80GB) |
92.00.45.00.05 |
Added security protection to the I2C interface. |
Same as above. |
Same as above. |
VBIOS (A100 SystemB 80GB) |
92.00.81.00.06 |
New support |
N/A |
Same as above. |
U.2 NVMe (Samsung) |
EPK9CB5Q |
Refer to DGX A100 U.2 NVMe Changes for the list of changes. |
5 minutes |
0 minutes |
U.2 NVMe (Kioxia) |
105 |
No change |
Same as above. |
Same as above. |
M.2 NVMe (Samsung version 1) |
EDA7602Q |
No change |
0 minutes |
0 minutes |
M.2 NVMe (Samsung version 2) |
GDC7302Q |
New support |
Same as above. |
Same as above. |
FPGA (GPU sled) |
3.0e |
New support |
22 minutes |
21 minutes |
CEC1712 SPI (GPU sled) |
4.0 |
New support |
3 minutes |
3 minutes |
PSU (Delta rev04) |
Primary 1.7/ Secondary 1.7/ Community 1.7 |
New support |
0 minutes |
0 minutes |
PSU (Delta rev03) |
Primary 1.6/ Secondary 1.6/ Community 1.7 |
No change |
90 minutes |
Same as above. |
PSU (Delta rev02) |
Primary 1.6/ Secondary 1.6/ Community 1.7 |
No change |
Same as above. |
Same as above. |
PSU (LiteOn) |
908 |
No change |
0 minutes |
Same as above. |
Updating Components with Secondary Images
Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
SBIOS: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image.
When using
update_fw all
, the update container updates both active and inactive images.BMC: The two images are referred to as active and inactive, where the active is the currently running image and the inactive is the backup image.
The update container can only update the inactive image, and will update it only if the active image needs to be updated. After the update is completed, the updated inactive image becomes the active image. Because the active image is now updated, subsequent
update_fw all
commands will not update the inactive image. To update the inactive image in this case, useupdate_fw BMC --inactive
. Since the container does not support updating the active image directly, commands such asupdate_fw BMC -a -f
will not work.
DO NOT UPDATE DGX A100 CPLD FIRMWARE UNLESS INSTRUCTED
When updating DGX A100 firmware using the Firmware Update Container, do not update the CPLD firmware unless the DGX A100 system is being upgraded from 320GB to 640GB.
The current DGX A100 Firmware Update Container will not automatically update the CPLD firmware (for example, when running update_fw all
). It is possible to update the CPLD firmware using “ update_fw CPLD
”; however, it is strongly recommended that the CPLD firmware not be updated manually unless specifically instructed by NVIDIA Enterprise Support (or email enterprisesupport@nvidia.com). If the DGX A100 is upgraded from 320GB to 640GB, the CPLD firmware update should be performed as instructed.
Special Instructions for Red Hat Enterprise Linux 7
This section describes the actions that must be taken before updating firmware on DGX A100 systems installed with Red Hat Enterprise Linux. There are two options for meeting these requirements.
Option 1: Update to EL7-22.05
Refer to the DGX Software for Red Hat Enterprise Linux 7 Release Notes for more information.
Important
Updating the DGX software for Red Hat Enterprise Linux will update the Red Hat Enterprise Linux installation to 7.9 or later. If you do not want to update your Red Hat Enterprise Linux 7 installation, then choose Option 2.
Option 2: Install mpt3sas 31.101.01.00-0
These instructions apply if:
You do not want to update your Red Hat Enterprise Linux installation, and
Your system is currently installed with Red Hat Enterprise Linux 7.7 or later.
Note
If your system is installed with Red Hat Enterprise Linux 7.6 or earlier, contact NVIDIA Enterprise Support for assistance.
Perform this step if your system is no longer pointing to the NVIDIA DGX software repository.
On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.
$ sudo subscription-manager repos --enable=rhel-7-server-extras-rpms $ sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
Run the following command to install the DGX software installation package and enable the NVIDIA DGX software repository.
Attention
By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.
$ yum install -y \ https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-20.03-1.el7.x86_64.rpm
Install
mpt3sas
31.101.01.00-0.$ sudo yum install mpt3sas-dkms
Load the
mpt3sas
driver into the Red Hat Enterprise Linux kernel.$ sudo modprobe mpt3sas
You can verify the correct
mpt3sas
version is installed by issuing the following.$ yum list installed
Instructions for Updating Firmware
This section provides a simple way to update the firmware on the system using the firmware update container.
The commands use the .run file, but you can also use any method described in Using the DGX A100 FW Update Utility.
Caution
Do not log into the BMC dashboard UI while a firmware update is in progress.
Stop all unnecessary system activities before attempting to update firmware.
Stop all GPU activity, including accessing nvidia-smi, as this can prevent the VBIOS from updating.
When issuing
update_fw all
, stop the following services if they are launched from Docker through thedocker run
command:dcgm-exporter
nvidia-dcgm
nvidia-fabricmanager
nvidia-persistenced
xorg-setup
lightdm
nvsm-core
kubelet
The container will attempt to stop these services automatically, but will be unable to stop any that are launched from Docker.
Do not add additional loads on the system (such as user jobs, diagnostics, or monitoring services) while an update is in progress. A high workload can disrupt the firmware update process and result in an unusable component.
When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold. If the warning is encountered, you are strongly advised to take action to reduce the workload before proceeding with the update.
Check if updates are needed by checking the installed versions.
$ sudo ./nvfw-dgxa100_22.5.5_220518.run show_version
If there is “no” in any up-to-date column for updatable firmware, then continue with the next step.
If all up-to-date column entries are “yes”, then no updates are needed and no further action is necessary.
Perform the update for all firmware supported by the container.
$ sudo ./nvfw-dgxa100_22.5.5_220518.run update_fw all
Depending on the firmware that is updated, you may be prompted to either reboot the system or power cycle the system.
If you are prompted to reboot, issue
$ sudo reboot
If you are prompted to power cycle, you can issue the following two commands (there is no output with the first command).
$ sudo ipmitool raw 0x3c 0x04 $ sudo ipmitool chassis power cycle
After rebooting or power cycling the system, you may need to perform another
update_fw all
to update other firmware.Either repeat Step 1 to check if updates are needed and then perform Step 2 if needed, or
Repeat Step 2 just in case updates are needed.
If you perform another
update_fw all
, you may be prompted again to either reboot the system or power cycle the system.See DGX A100 Firmware Update Process for more information about the update process.
You can verify the update by issuing the following.
$ sudo ./nvfw-dgxa100_22.5.5_220518.run show_version
Example output for a DGX A100 640GB system
CEC
============
Onboard Version Manifest up-to-date
MB_CEC(enabled) 3.28 3.28 yes
Delta_CEC(enabled) 4.00 4.00 yes
BMC DGX
=========
Image Id Status Location Onboard Version Manifest up-to-date
0:Active Boot Online Local 00.17.07 00.17.07 yes
1:Inactive Updatable Local 00.17.07 00.17.07 yes
SBIOS
=======
Image Id Onboard Version Manifest up-to-date
0:Active Boot Updatable 1.13 1.13 yes
1:Inactive Updatable 1.13 1.13 yes
Switches
============
PCI Bus# Model Onboard Version Manifest FUB Updated? up-to-date
DGX - 0000:91:00.0(U261) 88064_Retimer 3.1.0 3.1.0 N/A yes
DGX - 0000:88:00.0(U260) 88064_Retimer 3.1.0 3.1.0 N/A yes
DGX - 0000:4f:00.0(U262) 88064_Retimer 3.1.0 3.1.0 N/A yes
DGX - 0000:48:00.0(U225) 88080_Retimer 3.1.0 3.1.0 N/A yes
DGX - 0000:01:00.0(U1) PEX88096 2.0 2.0 N/A yes
DGX - 0000:81:00.0(U3) PEX88096 2.0 2.0 N/A yes
DGX - 0000:b1:00.0(U4) PEX88096 2.0 2.0 N/A yes
DGX - 0000:41:00.0(U2) PEX88096 2.0 2.0 N/A yes
DGX - 0000:c4:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
DGX - 0000:c5:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
DGX - 0000:c6:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
DGX - 0000:c7:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
DGX - 0000:c8:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
DGX - 0000:c9:00.0 LR10 92.10.18.00.01 92.10.18.00.01 N/A yes
Mass Storage
==============
Drive Name/Slot Model Number Onboard Version Manifest up-to-date
nvme0n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme1n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme2n1 Samsung MZ1LB1T9HALS-00007 EDA7602Q EDA7602Q yes
nvme3n1 Samsung MZ1LB1T9HALS-00007 EDA7602Q EDA7602Q yes
nvme4n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme5n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme6n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme7n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme8n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme9n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
Video BIOS
============
Bus Model Onboard Version Manifest FUB Updated? up-to-date
0000:07:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:0f:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:47:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:4e:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:87:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:90:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:b7:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
0000:bd:00.0 A100-SXM4-80GB 92.00.45.00.05 92.00.45.00.05 yes yes
Power Supply
==============
ID Vendor Model MFR ID Revision Status Onboard Version Manifest up-to-date
PSU 0: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 0: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 0: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 1: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 1: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 1: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 2: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 2: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 2: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 3: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 3: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 3: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 4: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 4: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 4: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 5: Communication Delta ECD16010092 Delta 03 ok 01.07 01.07 yes
PSU 5: Secondary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
PSU 5: Primary Delta ECD16010092 Delta 03 ok 01.06 01.06 yes
CPLD
============
Onboard Version Manifest up-to-date
MB_CPLD 1.05 1.05 yes
MID_CPLD 1.03 1.03 yes
* CPLD won't be updated by default (`update_fw all`), use `update_fw CPLD` if it's needed
FPGA
========
Onboard version Manifest up-to-date
03.0e 03.0e yes
Known Issues
Chassis Power State Remains On
Issue
When the system completes a GracefulShutdown
and is queried by using Redfish, the Chassis Power State remains On
, but the system power status will be correctly reported by using IPMI and in the BMC Web UI.
Explanation
The system can be powered on by using the IPMI command or by using BMC Web UI.
Incorrect Thermal and Voltage Sensor and Fan RPM Values are Displayed
Issue
Some thermal and voltage sensors and FAN RPMs might show incorrect (Zero reading) values when retrieved by using the Redfish APIs.
Explanation
This issue is currently under investigation.
Processor Power Limit and Power Metrics are Not Supported
Issue
Some thermal sensor reading and FAN RPMs might show incorrect (a zero reading) value when they are retrieved through the Redfish API.
Explanation
This issue is currently under investigation.
IndicatorLED Status Might Display an Incorrect State
Issue
The IndicatorLED status in Redfish might display an incorrect state for the system and disk resources.
Explanation
This issue is currently under investigation.
Unable to Update BMC Firmware
Issue
To run Firmware Update Container Version 22.5.5, you must use MB_CEC
version 3.28.
Explanation
If you are using an MB_CEC
version that is earlier than 3.28, you must first update to firmware update container version 21.03.6 or later.
Firmware Update Container Unable to Recover PSU with Corrupted Firmware
Issue
The firmware update container cannot recover the PSU firmware when the container cannot determine the hardware revision of the PSU.
Explanation
To recover a PSU that is revision 00-03, use firmware update container version 21.11.4.
Setting Up Active Directory Settings Might Fail with “Invalid Domain Name” Error
Issue
After logging into the BMC dashboard UI and setting up and enabling Active Directory Authentication, an “Invalid Domain Name” error may occur.
Explanation
If you encounter this error, set up the DNS manually as follows:
Login to the BMC UI dashboard.
Navigate to Settings > Network Settings > DNS Configuration > “Domain Name Server Setting”
Find “Domain Name Server Setting” and change “Automatic ” to “Manual “.
Replace “DNS Server 1” IP to ”
8.8.8.8
” (the IP is dns.google)Click Save and accept the alert to restart the BMC network.
NVSM Incorrectly Reports the Delta PSU Part Number Instead of the Model Numbers
Issue
When issuing show_version
or show_fw_manifest
, the number associated with the Delta PSU is the part number instead of the model number.
Explanation
This will be resolved in a future release.
BMC KVM Screen May Show “No Signal” Under Certain Conditions
Issue
When attempting to view the DGX A100 console from the BMC Web UI KVM, the screen might show``No Signal`` if you cold reset the BMC and reboot the server. This is due to a rare condition between BMC and the SBIOS.
For example, the issue might occur after performing the following:
Issue the command to cold reset the BMC.
$ sudo ipmitool mc reset cold
Wait about 30 seconds and issue the command to reboot the system.
$ sudo reboot
Explanation
You can recover the system by issuing a hard reset from the Web UI.
SBIOS “Bootup NumLock State” not Enforced
Issue
When turning NumLock to OFF after setting “Boot NumLock State
” to ON from the SBIOS setup menu, NumLock remains off after rebooting the server. Similarly, when turning NumLock to ON after setting “Boot NumLock State
” to OFF from the SBIOS setup menu, NumLock remains on after rebooting the server.
Explanation
This feature is currently not implemented in the DGX A100 SBIOS.
NVSM Fails to Run the FWUC show_version Command
Issue
This is an issue with NGC access without entering an email address for authentication
Explanation
To resolve this issue, upgrade to NVSM version 20.09.37 and later.
NVSM Exits With an Error Message When Updating Firmware by Using NVSM
Issue
When the firmware update container uses NVSM to update the firmware, after a few minutes, NVSM exits with the following message:
('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)).
In the nvidia-fw.log
, the update process continues in the background until it has completed.
Explanation
The system can be powered on by using the IPMI command or by using BMC Web UI.
After a Firmware Update, Only the Delta Manifest Version Displays
Issue
After a firmware update, the Delta PSU version and the Delta manifest version should be displayed, but only the Delta manifest version appears.
Explanation
This issue will be fixed in a future release.
PSUs Sometimes Display an Error after an Update
Issue
When you run show_version
, you might see ERR:retries
in the container output and components might be listed as not-supported
.
Explanation
To see the correct firmware versions and component status, run show_version
again.
Force FPGA Update Sometimes Fails
Issue
When you force-update FPGA firmware, the update might fail, and the``Auto updates are not allowed on a busy system`` message is displayed.
Explanation
This issue will be fixed in a future release.
Updating to Firmware Container Version 22.5.5 Fails for SSDs
Issue
The firmware container version 22.5.5 fails to update SSDs on systems where NVMe multipathing is enabled.
Explanation
If the container fails to update a device that has c<number>
, for example nvme0c0n1
, in its name. This is because the device was a multipath device.
Disabling multipathing depends on the type of multipathing and the way it was enabled. As a result, you can update each SSD individually without disabling multipathing by running the following command:
update_fw SSD --select-ssd <ssd_name>
For example, here is a system with the following devices in show
version:
Mass Storage
==============
Drive Name/Slot Model Number Onboard Version Manifest up-to-date
nvme0n1 Samsung MZWLJ3T8HBLS-00007 EPK9CB5Q EPK9CB5Q yes
nvme1n1 Samsung MZ1LB1T9HALS-00007 EDA7602Q EDA7602Q yes
nvme2n1 Samsung MZ1LB1T9HALS-00007 EDA7602Q EDA7602Q yes
To update nvme0n1
, run the following command:
update_fw SSD --select-ssd nvme0n1