Previous Releases

Version 19.10.7

The DGX-1 Firmware Update container version 19.10.7 is available.

  • Package name:nvfw-dgx1_19.10.7.tar.gz
  • Image name: nvfw-dgx1:19.10.7
  • Run file name: nvfw-dgx1_19.10.7.run

Obtain the files from the NVIDIA Enterprise Support announcement System Firmware Upgrade 19.10.7 for all NVIDIA DGX-1 Server (requires login).

Contents of the DGX-1 Firmware Update Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component Version Key Changes
BMC 3.36.30
  • Added HTML5 support for the Remote Console
  • Removed Java-based Remote Console
Note: Be sure to clear your browser cache to see the new Remote Console.
SBIOS S2S_3A10
  • Incorporated Intel microcode to mitigate new side channel attacks (Zombieload v1).
SSD (Samsung SM863A) GXM1103Q No change from previous release.
VBIOS (DGX-1 with V100, 16 GB) 88.00.18.00.01 No change from previous release.
VBIOS (DGX-1 with V100, 32 GB) 88.00.80.00.04 No change from previous release.
VBIOS (DGX-1 with P100) 86.00.41.00.05 No change from previous release.
PSU 00.03.07 No change from previous release.

Changes in the Container in this Release

  • Fixed unexpected error appearing upon exiting the container after successful PSU update.
  • Fixed BMC update failing with an unexpected error.
  • Fixed show_version command reporting "???" for the VBIOS version.
  • Fixed firmware update errors on EL7-19.07.
  • Fixed update output only showing the last VBIOS updated, instead of listing all the VBIOSes updated.

Special Notes

Note: If updating the BMC from any version earlier than 3.27.30, the update can take from 30 to 50 minutes to complete.
  • When updates to the BMC or PSU are initiated,
    • The BMC is (cold) reset to be put in a known good state before the update, then
    • Additional logs are gathered for troubleshooting purposes and made available in /var/log/comp_fw_log.txt.

      The logs are gathered before updating and upon completion of the update or in the event of an update failure.

  • To prevent NVSM services from interfering with BMC and PSU updates, the container stops the following services before applying the update:
    • nvsm-apis-gpumonitor
    • nvsm-apis-plugin-storage
    • nvsm-apis-selwatcher
    • nvsm-apis-plugin-memory
    • nvsm-apis-plugin-environment
    • nvsm-sys-dshmnvsm-env-dshm
    • nvsm-storage-dshm
    System health monitor will not be available until firmware update completes.
  • For the PSU update, the container implements a protective check which requires the system to be fully redundant (all four supplies are installed and in a healthy state) in order for the update to occur.

    If you are using only three of the four PSUs, the full power redundancy requirement can be overridden with the Docker run environment (DGX_MAX_PSU) as follows.

    docker run -e DGX_MAX_PSU=3 --privileged -ti -v /:/hostfs <container_name> update_fw

Version 19.04.1

The DGX-1 Firmware Update container version 19.04.1 is available.

  • Package name:nvfw-dgx1_19.04.1.tar.gz
  • Image name: nvfw-dgx1:19.04.1
  • Run file name: nvfw-dgx1_19.04.1.run

Obtain the files from the NVIDIA Enterprise Support announcement DGX-1 Firmware Update Container Version 19.04.1 (requires login).

Contents of the DGX-1 Firmware Update Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component Version Key Changes
BMC 3.30.30 Note: The BMC update process can take about 50 minutes to complete if updating from a version earlier than 3.27.30.
  • Added support for sending SNMPv3 Traps.
  • Added GPU Page Retirement tracking.
  • Added ability to configure KVM and VMedia via ipmitool.
  • Added ability to enabled/disable SNMPv3 via ipmitool.
  • Implemented IPMI command for OEM debugging.
  • Fixed BMC/SBIOS providing incorrect mapping for memory DIMM errors.
  • Fixed PSU firmware update disruption by implementing mutual exclusion logic in the BMC.
SBIOS 3A08
  • Fixed BMC/SBIOS providing incorrect mapping for memory DIMM errors.
  • USB ports default to USB 3.0.
SSD (Samsung SM863A) GXM1103Q Added to the container.
VBIOS (DGX-1 with V100, 16 GB) 88.00.18.00.01 No change from previous release.
VBIOS (DGX-1 with V100, 32 GB) 88.00.80.00.04 Supports all HBM memory sources.
VBIOS (DGX-1 with P100) 86.00.41.00.05 No change from previous release.
PSU 00.03.07 Added to the container.

Changes in the Container in this Release

Note: If updating the BMC from any version earlier than 3.27.30, the update can take from 30 to 50 minutes to complete.
  • Added integration with NVSM (requires DGX OS Server 4.0.5 or later).

    This allows firmware to be updated using a .run file that simplifies the steps needed. See the DGX-1 User Guide for instructions on obtaining and using the .run file.

  • Changed the container naming convention and now provide one file for all DGX-1 configurations.
  • When updates to the BMC or PSU are initiated,
    • The BMC is (cold) reset to be put in a known good state before the update, then
    • Additional logs are gathered for troubleshooting purposes and made available in /var/log/comp_fw_log.txt.

      The logs are gathered before updating and upon completion of the update or in the event of an update failure.

  • To prevent NVSM services from interfering with BMC and PSU updates, the container stops the following services before applying the update:
    • nvsm-apis-gpumonitor
    • nvsm-apis-plugin-storage
    • nvsm-apis-selwatcher
    • nvsm-apis-plugin-memory
    • nvsm-apis-plugin-environment
    • nvsm-sys-dshmnvsm-env-dshm
    • nvsm-storage-dshm

    System health monitor will not be available until firmware update completes.

  • For the PSU update, the container implements a protective check which requires the system to be fully redundant (all four supplies are installed and in a healthy state) in order for the update to occur.

    If you are using only three of the four PSUs, the full power redundancy requirement can be overridden with the Docker run environment (DGX_MAX_PSU) as follows.

    docker run -e DGX_MAX_PSU=3 --privileged -ti -v /:/hostfs <container_name> update_fw

Known Issues

VBIOS Update Status Only Shows One GPU

Issue

On an DGX-1 with Tesla P100 , when updating the VBIOS for all the GPUs in the system, the "Firmware Update in Progress" output banner shows only the last GPU to be updated instead of each or all GPUs.

Explanation

The firmware update container does not report which GPU VBIOS is flashed as it occurs, but shows the last GPU to indicate that all GPUs are being updated. In the background, all the GPUs are sequentially flashed with the new VBIOS until the last GPU completes the update successfully.

Recovery for PSU Update Failure

Issue

On rare occasions, the recovery mechanism in the container may not be able to recover from a failure in the PSU update process.

Action to Take

If the container does not recover, contact NVIDIA Enterprise Support for assistance.

Update May Stop with an Unexpected Error

Issue

When updating the BMC, the update may fail with the following error code.

TypeError: __init__() takes exactly 4 arguments
Recommendation

Attempt to run the container again for the component that failed.  If the component update continues to fail, contact NVIDIA Enterprise Support.

Unexpected Error May Occur Upon Exiting the Container

Issue

After successfully completing an update and then exiting the container, the following error message may appear.

Method not supported in this mode
Details and Recommendation

This can occur if the CPU is under a high load while the container runs. The update is successful and no further action is needed.

To avoid this error, stop all GPU and CPU intensive applications. You can also use the show_version option when running the container to confirm the firmware is updated to the correct version.

Version 20181107

The DGX-1 with Tesla V100 Firmware Update container version 20181107 is available.

  • Package name: nvidia-dgx-fw-0102-20181107.tar.gz
  • Image name: nvidia-dgx-fw-0102-20181107

Contents of the DGX-1 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component Version
SBIOS S2W_3A06
VBIOS 88.00.18.00.01 (16 GB)

88.00.43.00.04 (32 GB)

Changes in this Release

  • Container
    • Removed BMC and PSU firmware updates due to potential issues with the update process.
  • System BIOS
    • Updated to version S2W_3A06

      Updated microcode to address Spectre vulnerability.

Known Issues

Recovery for PSU Update Failure

Issue

On rare occasions, the recovery mechanism in the container may not be able to recover from a failure in the PSU update process.

Action to Take

If the container does not recover, contact NVIDIA Enterprise Support for assistance.

SNMPv3 Traps are not Available for the BMC

Issue

The BMC is not capable of sending SNMPv3 traps at this time.

Workaround

From the BMC dashboard->SNMP Community Settings, enable traps for SNMPv2.