Version 19.04.1

The DGX-1 Firmware Update container version 19.04.1 is available.

  • Package name: nvfw-dgx1_19.04.1.tar.gz

  • Image name: nvfw-dgx1:19.04.1

  • Run file name: nvfw-dgx1_19.04.1.run

Obtain the files from the NVIDIA Enterprise Support announcement DGX-1 Firmware Update Container Version 19.04.1 (requires login).

Contents of the DGX-1 Firmware Update Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component

Version

Key Changes

BMC

3.30.30

Note: The BMC update process can take about 50 minutes to complete if updating from a version earlier than 3.27.30.

  • Added support for sending SNMPv3 Traps.

  • Added GPU Page Retirement tracking.

  • Added ability to configure KVM and VMedia via ipmitool.

  • Added ability to enabled/disable SNMPv3 via ipmitool.

  • Implemented IPMI command for OEM debugging.

  • Fixed BMC/SBIOS providing incorrect mapping for memory DIMM errors.

  • Fixed PSU firmware update disruption by implementing mutual exclusion logic in the BMC.

SBIOS

3A08

  • Fixed BMC/SBIOS providing incorrect mapping for memory DIMM errors.

  • USB ports default to USB 3.0.

SSD (Samsung SM863A)

GXM1103Q

Added to the container.

VBIOS (DGX-1 with V100, 16 GB)

88.00.18.00.01

No change from previous release.

VBIOS (DGX-1 with V100, 32 GB)

88.00.80.00.04

Supports all HBM memory sources.

VBIOS (DGX-1 with P100)

86.00.41.00.05

No change from previous release.

PSU

00.03.07

Added to the container.

Changes in the Container in this Release

Note

If updating the BMC from any version earlier than 3.27.30, the update can take from 30 to 50 minutes to complete.

  • Added integration with NVSM (requires DGX OS Server 4.0.5 or later).

    This allows firmware to be updated using a .run file that simplifies the steps needed. See the DGX-1 User Guide for instructions on obtaining and using the .run file.

  • Changed the container naming convention and now provide one file for all DGX-1 configurations.

  • When updates to the BMC or PSU are initiated,

    • The BMC is (cold) reset to be put in a known good state before the update, then

    • Additional logs are gathered for troubleshooting purposes and made available in /var/log/comp_fw_log.txt.

      The logs are gathered before updating and upon completion of the update or in the event of an update failure.

  • To prevent NVSM services from interfering with BMC and PSU updates, the container stops the following services before applying the update:

    • nvsm-apis-gpumonitor

    • nvsm-apis-plugin-storage

    • nvsm-apis-selwatcher

    • nvsm-apis-plugin-memory

    • nvsm-apis-plugin-environment

    • nvsm-sys-dshmnvsm-env-dshm

    • nvsm-storage-dshm

    System health monitor will not be available until firmware update completes.

  • For the PSU update, the container implements a protective check which requires the system to be fully redundant (all four supplies are installed and in a healthy state) in order for the update to occur.

    If you are using only three of the four PSUs, the full power redundancy requirement can be overridden with the Docker run environment (DGX_MAX_PSU) as follows.

    docker run -e DGX_MAX_PSU=3 --privileged -ti -v /:/hostfs <container_name> update_fw
    

Known Issues

VBIOS Update Status Only Shows One GPU

Issue

On an DGX-1 with Tesla P100 , when updating the VBIOS for all the GPUs in the system, the “Firmware Update in Progress” output banner shows only the last GPU to be updated instead of each or all GPUs.

Explanation

The firmware update container does not report which GPU VBIOS is flashed as it occurs, but shows the last GPU to indicate that all GPUs are being updated. In the background, all the GPUs are sequentially flashed with the new VBIOS until the last GPU completes the update successfully.

Recovery for PSU Update Failure

Issue

On rare occasions, the recovery mechanism in the container may not be able to recover from a failure in the PSU update process.

Action to Take

If the container does not recover, contact NVIDIA Enterprise Support for assistance.

Update May Stop with an Unexpected Error

Issue

When updating the BMC, the update may fail with the following error code.

TypeError: __init__() takes exactly 4 arguments

Recommendation

Attempt to run the container again for the component that failed.  If the component update continues to fail, contact NVIDIA Enterprise Support.

Unexpected Error May Occur Upon Exiting the Container

Issue

After successfully completing an update and then exiting the container, the following error message may appear.

Method not supported in this mode

Details and Recommendation

This can occur if the CPU is under a high load while the container runs. The update is successful and no further action is needed.

To avoid this error, stop all GPU and CPU intensive applications. You can also use the show_version option when running the container to confirm the firmware is updated to the correct version.