DGX Station A100 Firmware Update Container Version 22.02.1
The DGX Station A100 Firmware Update Container version 22.02.1 is available.
Package name:
nvfw-dgxstationa100_22.2.1_220209.tar.gz
Run file name:
nvfw-dgxstationa100_22.2.1_220209.run
Image name:
nvfw-dgxstationa100:22.2.1
ISO image:
DGXSTATIONA100_FWUI-22.2.1-2022-02-15-10-20-25.iso
PXE netboot:
pxeboot-DGXSTATIONA100_FWUI-22.2.1.tgz
Highlights and Changes in this Release
This release is supported with the following DGX OS software:
DGX OS 5.0.2 or later
EL7-21.04 or later
EL8-20.11 or later
The following issues were fixed in this release:
BMC
Fixed the issue where after you run the
bmc mc cold reset
command, the BMC was generating two or three “sels” entries with pre-initialized timestamps.Fixed the issue where the severity of system events, including audit SEL, are all marked as Critical.
Fixed the issue where the USB and Built-in UEFI Boot cannot be detected at the same time.
Fixed the issue where the sensor names, the threshold, and the sensor type were not in sync with the sensor list file.
Fixed the issue where if your BMC web UI session times out, and you are locked out, you needed to log in twice to enter the web UI again.
Fixed the issue where after you run the
$ sudo ipmitool sel clear
command to clear the system event log, no log entry existed to help you verify that the SEL was cleared.Fixed the issue where after you enter the incorrect password 5 or more times, you do not know that you have been locked out of the web UI.
Fixed the issue in the BMC web GUI, when you click Logs & Reports > Debug log, there was no Debug Log button.
SBIOS
Fixed the issue where the ECC Leaky Bucket Threshold help string mentions that the range is 0 to 255, but the default value of this option is actually 1000.
Contents of the DGX Station A100 System Firmware Update Container
This container includes the firmware binaries and update utilities for the firmware in the following table:
Component |
Version |
---|---|
BMC |
01.24.00 |
SBIOS |
10.16 |
Retimer |
1.0.125 |
VBIOS |
|
M.2 Micron 7300 MTFDHBG1T9TDF SSD |
95420260 |
U.2 KIOXIA CM6 SSD |
0105 |
FPGA |
2.71 |
Storage Backplane |
0.3 |
NVFlash |
5.714.0 |
Important
When you update the Retimer, Backplane, and FPGA components with the firmware update container tarball, you must add the --network host
argument. If you do not add this argument, the update will fail.
Here is an example of the command for a successful update:
$ sudo docker run --rm --network host -ti --privileged -v /:/hostfs nvfw-dgxstationa100:22.2.1 update_fw Backplane -f
Updating the Firmware to Version 22.02.1
This section explains how to update the firmware on the system by using the firmware update container. It includes instructions to complete a transitional update for systems that require the update.
stop all unnecessary system activities.
Caution
While an update is in progress, do not add additional loads on the system, such as Kubernetes jobs or other user jobs or diagnostics. A high GPU workload can disrupt the firmware update process and result in an unusable component.
The commands use the .run
file, but you can also use any method described in Using the DGX Station A100 FW Update Utility.
Determine whether updates are needed by checking the installed versions.
$ sudo ./nvfw-dgxstationa100_22.2.1_220209.run show_version
If there is a
no
in any up-to-date column for updatable firmware, proceed to the next step.If all up-to-date column entries display a
yes
, no updates are required and no additional action is necessary.
Stop the
gdm3
service.$ sudo systemctl stop gdm3
Complete the update for all firmware that is supported by the container.
$ sudo ./nvfw-dgxstationa100_22.2.1_220209.run update_fw all
Depending on the firmware that is updated, you might be prompted to reboot the system or power cycle the system:
If you are prompted to reboot, issue the following command:
$ sudo reboot
If you are prompted to power cycle, issue the following commands:
$ sudo ipmitool chassis power cycle
You can verify the update by issuing the following command:
$ sudo ./nvfw-dgxstationa100_22.2.1_220209.run show_version
Here is an example output for a DGX Station A100 40GB system:
BMC DGX Station A100
======================
Image Id Status Location Onboard Version Manifest up-to-date
N/A Online Local 01.24.00 01.24.00 yes
FPGA
========
Onboard version Manifest up-to-date
2.71 2.71 yes
Storage Backplane
==================
Bus Onboard Version Manifest up-to-date
N/A 0.3 0.3 yes
Retimer Loc.
=============
PCIe Slot# Onboard Version Manifest up-to-date
Retimer@slot4 1.0.125 1.0.125 yes
Retimer@slot5 1.0.125 1.0.125 yes
Retimer@slot6 1.0.125 1.0.125 yes
Retimer@slot7 1.0.125 1.0.125 yes
SBIOS
=======
Image Id Onboard Version Manifest up-to-date
N/A L10.16 L10.16 yes
Video BIOS
============
Bus Model Onboard Version Manifest up-to-date
0000:01:00.0 A100-SXM4-40GB 92.00.48.00.01 92.00.48.00.01 yes
0000:47:00.0 A100-SXM4-40GB 92.00.48.00.01 92.00.48.00.01 yes
0000:81:00.0 A100-SXM4-40GB 92.00.48.00.01 92.00.48.00.01 yes
0000:c2:00.0 A100-SXM4-40GB 92.00.48.00.01 92.00.48.00.01 yes
Mass Storage
==============
Drive Name/Slot Model Number Onboard Version Manifest up-to-date
nvme0n1 Micron 7300_MTFDHBG1T9TDF 95420260 95420260 yes
nvme1n1 Kioxia KCM6DRUL7T68 0105 0105 yes
DGX Station A100 Firmware Known Issues
This section provides a list of the known issues in version 22.02.1.
VBIOS update fails
Issue
VBIOS update fails on Red Hat Enterprise Linux 9 due to system service/process caching the resource to be upgraded.
Explanation
The following services (system processes) must be stopped manually for the firmware update to start:
process nvidia-persiste(pid 5372)
process nv-hostengine(pid 2723)
process cache_mgr_event(pid 5276)
process cache_mgr_main(pid 5278)
process dcgm_ipc(pid 5279)
If xorg is holding the resources, try to stop it by running
$ sudo systemctl stop (display manager) where the (display manager) can be acquired by
$ cat /etc/X11/default-display-manager
Cannot Use a Forward Slash When Creating LDAP Group Settings in the BMC
Issue
On the BMC dashboard, when you try to create group roles, you cannot include a forward slash (/
) in the Group Name or Group Domain fields. For example, “Bay/Ships” will not work.
Explanation
NVIDIA is investigating the issue, and there is no workaround at this time.
Boot Options Do Not Persist After a BIOS Update
Issue
After you update the BIOS from version 9.28c to L10.16, the boot options that you previously set do not persist.
Explanation
NVIDIA is investigating the issue, and there is no workaround at this time.
In Dockerless Mode, the Onboard Version Displays an Unknown Version Status
Issue
In RHEL7-21.07, in dockerless mode, when you run the show_version
command, the onboard version displays an unknown version
status.
Explanation
nvipmitool
is used to query the FPGA, Backplane, PSU, and Retimer firmware versions.
The tool that is bundled in the DGX Station A100 firmware update container works only on Ubuntu and not on RHEL. As a result, in dockerless mode, when the DGX system tries to locate nvipmitool
, the unknown version
string is displayed.
Workaround
Important
nvipmitool is now only bundled in the rhel7-r470-cuda-11-4
package and is not installed by default on RHEL7-21.10.
To use the tool in RHEL7-21.10, and if you do not want to upgrade to the R470 driver, run the following command:
yum install https://international.download.nvidia.com/dgx/repos/rhel7-r470-cuda11-4/nvipmitool-1.0.60_rhel7_release-1.x86_64.rpm";;
Cannot Update the FPGA in Dockerless Mode
Issue
When you try to update the FPGA, and you are on RHEL7, the upgrade will fail.
Workaround
There is no workaround at this time.
Cannot Force a Backplane Update in Dockerless Mode
Issue
When you try to force a Backplane update, and you are on RHEL7, the upgrade will fail.
Workaround
There is no workaround at this time.
Cannot Force a Retimer Update in Dockerless Mode
Issue
When you try to force a Retimer update, and you are on RHEL7, the upgrade will fail.
Workaround
There is no workaround at this time.
New FPGA Version Number Does Not Display After an Update
Issue
After you update the FPGA, and the DGX system reboots, the previous FPGA version is still displayed.
Explanation
For the FPGA update to take effect, a DC power cycle option is required, but currently only the Reboot after update option exists.
Workaround
Complete one of the following options:
After you complete the FPGA firmware version update, complete the following steps:
Click BMC WebUI > Power Control.
Power off the system
Click BMC WebUI > Power Control.
Power on the system.
In a Command Prompt window, run the following command:
$ sudo ipmitool -I lanplus -H ${BMC_IP} -U ${BMC_USER} -P ${BMC_PW} chassis power cycle
Help Output Does Not Display Information for the Firmware Update Container Usage
Issue
In the interactive menu, when you click Show update container usage, instead of displaying the overall firmware update container usage information, only the information for previous few components is displayed.
Workaround
There is no workaround at this time.
Incorrect Prompt Message When Upgrading on EL8-21.08
Issue
On EL8-21.08, if the Xorg processes are holding onto the resource that will be upgraded, after you issue cat /etc/X11/default-display-manager
, the following incorrect message is displayed:
No such file or directory
Workaround
For a successful upgrade on EL8-21.08, before you run the update_fw all
or update_fw VBIOS -f
commands, run the following command:
$ sudo systemctl stop gdm
After unloading the NVIDIA Driver, ?? is Displayed as the VBIOS Onboard Version
Issue
On RHEL, after you unload the NVIDIA driver, and run the show_version
command,**??** is displayed as the VBIOS onboard version.
Workaround
There is no workaround at this time.
When Updating the Backplane Firmware, a Corrupted Screen is Displayed
Issue
When you use the firmware update ISO to update the Backplane firmware with the --force
argument with SBIOS L9.28C, a corrupted screen appears.
Workaround
Update your SBIOS firmware to L10.16 and then update the Backplane firmware.
Unable to Install SSD Firmware After Multiple Upgrade and Downgrade Attempts
Issue
After multiple attempts to upgrade or downgrade the firmware, the SSD nvme1n1 KCM6DRUL7T68 firmware installation fails. Here is an example of an error message:
Failed to install SSD nvme1n1 KCM6DRUL7T68 0105
Workaround
Run the firmware update container again.
When Updating BMC Firmware, the BMC WebUI/KVM Connection Fails
Issue
When you update the BMC firmware from the BMC KVM by using one of the following options, the BMC shuts down its web service:
Firmware update container
Firmware user interface (UI)
Additional Information
The shutdown causes the BMC web UI/KVM to disconnect, but this connection is established again after the update is complete.
Workaround
Wait about 23 minutes and log in to the BMC web UI again.
Start the BMC KVM and verify that the BMC firmware has successfully updated.