Version 19.04.1#
The DGX-1 Firmware Update container version 19.04.1 is available.
Package name:
nvfw-dgx1_19.04.1.tar.gz
Image name:
nvfw-dgx1:19.04.1
Run file name:
nvfw-dgx1_19.04.1.run
Obtain the files from the NVIDIA Enterprise Support announcement DGX-1 Firmware Update Container Version 19.04.1 (requires login).
Contents of the DGX-1 Firmware Update Container#
This container includes the firmware binaries and update utilities for the firmware listed in the following table.
Component |
Version |
Key Changes |
---|---|---|
BMC |
3.30.30 |
Note: The BMC update process can take about 50 minutes to complete if updating from a version earlier than 3.27.30.
|
SBIOS |
3A08 |
|
SSD (Samsung SM863A) |
GXM1103Q |
Added to the container. |
VBIOS (DGX-1 with V100, 16 GB) |
88.00.18.00.01 |
No change from previous release. |
VBIOS (DGX-1 with V100, 32 GB) |
88.00.80.00.04 |
Supports all HBM memory sources. |
VBIOS (DGX-1 with P100) |
86.00.41.00.05 |
No change from previous release. |
PSU |
00.03.07 |
Added to the container. |
Changes in the Container in this Release#
Note
If updating the BMC from any version earlier than 3.27.30, the update can take from 30 to 50 minutes to complete.
Added integration with NVSM (requires DGX OS Server 4.0.5 or later).
This allows firmware to be updated using a .run file that simplifies the steps needed. See the DGX-1 User Guide for instructions on obtaining and using the .run file.
Changed the container naming convention and now provide one file for all DGX-1 configurations.
When updates to the BMC or PSU are initiated,
The BMC is (cold) reset to be put in a known good state before the update, then
Additional logs are gathered for troubleshooting purposes and made available in
/var/log/comp_fw_log.txt
.The logs are gathered before updating and upon completion of the update or in the event of an update failure.
To prevent NVSM services from interfering with BMC and PSU updates, the container stops the following services before applying the update:
nvsm-apis-gpumonitor
nvsm-apis-plugin-storage
nvsm-apis-selwatcher
nvsm-apis-plugin-memory
nvsm-apis-plugin-environment
nvsm-sys-dshmnvsm-env-dshm
nvsm-storage-dshm
System health monitor will not be available until firmware update completes.
For the PSU update, the container implements a protective check which requires the system to be fully redundant (all four supplies are installed and in a healthy state) in order for the update to occur.
If you are using only three of the four PSUs, the full power redundancy requirement can be overridden with the Docker run environment (
DGX_MAX_PSU
) as follows.docker run -e DGX_MAX_PSU=3 --privileged -ti -v /:/hostfs <container_name> update_fw
Known Issues#
VBIOS Update Status Only Shows One GPU#
Issue#
On an DGX-1 with Tesla P100 , when updating the VBIOS for all the GPUs in the system, the “Firmware Update in Progress” output banner shows only the last GPU to be updated instead of each or all GPUs.
Explanation#
The firmware update container does not report which GPU VBIOS is flashed as it occurs, but shows the last GPU to indicate that all GPUs are being updated. In the background, all the GPUs are sequentially flashed with the new VBIOS until the last GPU completes the update successfully.
Recovery for PSU Update Failure#
Issue#
On rare occasions, the recovery mechanism in the container may not be able to recover from a failure in the PSU update process.
Action to Take#
If the container does not recover, contact NVIDIA Enterprise Support for assistance.
Update May Stop with an Unexpected Error#
Issue#
When updating the BMC, the update may fail with the following error code.
TypeError: __init__() takes exactly 4 arguments
Recommendation#
Attempt to run the container again for the component that failed. If the component update continues to fail, contact NVIDIA Enterprise Support.
Unexpected Error May Occur Upon Exiting the Container#
Issue#
After successfully completing an update and then exiting the container, the following error message may appear.
Method not supported in this mode
Details and Recommendation#
This can occur if the CPU is under a high load while the container runs. The update is successful and no further action is needed.
To avoid this error, stop all GPU and CPU intensive applications. You can also use the show_version
option when running the container to confirm the firmware is updated to the correct version.