Version 19.12.1

The DGX Firmware Update container version 19.12.1 is available.

  • Package name:nvfw-dgx2_19.12.1_191204.tar.gz

  • Run file name: nvfw-dgx2_19.12.1_191204.run

  • Image name: nvfw-dgx2:19.12.1

Highlights and Changes in this Release

  • This release is supported with the following DGX OS software -

    • DGX OS 4.3 or later

    • EL7-19.11 or later

  • Fixed VBIOS not getting updated during combination or forced update.

  • Added “--update-backup-bmc” option for updating the secondary (backup) BMC image.

  • See DGX-2 Firmware Changes for the list of changes in individual components.

  • Removed the Samsung SSD second source firmware.

Contents of the DGX-2 System Firmware Container

This container includes the firmware binaries and update utilities for the firmware listed in the following table.

Component

Version | Key Changes

BMC

01.05.10 | See BMC Release Notes for the list of changes.

Change to the Update Process

Originally, only certain firmware components, such as the SBIOS, required rebooting the system after performing the update.

In order to ensure that all DGX-2 services continue running, you must reboot the DGX-2 after any firmware update for any component or group of components.

Updating Components with Secondary Images

Some firmware components provide a secondary image as backup. The following is the policy when updating those components:

  • SBIOS: Only the primary image is updated. To update both images, follow the instructions at Special Instructions for PSU, SBIOS, and BMC Firmware Updates.

  • BMC: Only the primary image is updated. To update the secondary (backup) image, include the --update-backup-bmc option in the update command.

  • FPGA: Only the primary image is updated.

Special Instructions for PSU, SBIOS, and BMC Firmware Updates

Before updating the PSU, SBIOS, or the BMC, refer to the following special instructions for guidance to ensure the updates are successful.

PSU Updates

SBIOS Updates

  • If the current BMC is version 1.05.7, then BMC should be updated before updating the SBIOS.

  • If the current SBIOS is a version earlier than 0.22 (such as 0.13 or 0.17), then you need to update the SBIOS from the BMC dashboard. See Updating the SBIOS Using the BMC Dashboard for instructions.

  • To update both primary and secondary SBIOS (after updating the BMC) using the container, do the following (assumes the primary SBIOS is the current, active SBIOS).

    1. Update the active SBIOS using the firmware update container.

    2. Designate booting from the secondary (inactive) SBIOS on the next boot.

      sudo ./nvfw-dgx2_19.12.1_191204.run sbios_slot --switch-nextboot-slot
      
    3. Reboot the DGX-2 to switch to the secondary SBIOS.

      telinit 1
      umount /raid
      sync
      ipmitool chassis power cycle
      
    4. Update the secondary (now active) SBIOS.

    5. Designate booting from the primary SBIOS on the next boot (to restore the primary SBIOS as the active SBIOS).

      sudo ./nvfw-dgx2_19.12.1_191204.run sbios_slot --switch-nextboot-slot
      
    6. Reboot the DGX-2 to switch back to the primary SBIOS.

      telinit 1
      umount /raid
      sync
      ipmitool chassis power cycle
      

BMC Updates

Known Issues

EEPROM Checksum Mismatch

Issue

BMC version 1.05.7 introduced an issue that could cause corruption in the BMC EEPROM. This is indicated by an EEPROM checksum mismatch error message when attempting to update any firmware.

You can also verify EEPROM corruption by issuing the following .. code:: text

sudo ./nvfw-dgx2_19.12.1_191204.run show_version

and then viewing the output for the error message.

note:: This error may be reported if a corrupt SBIOS produces a watchdog timeout during boot. In this case, the error message is erroneous. See the section watchdog-unique-1638911377 for instructions on confirming and then resolving the SBIOS corruption.

Resolution

The DGX-2 Firmware Update Container version 19.12.1 includes logic to detect and repair the corruption. Perform the following steps to repair the EEPROM corruption.

  1. If the BMC is not already updated, then update the BMC.

  2. Review the “current” and “next” boot SBIOS by issuing the following.

    sudo ./nvfw-dgx2_19.12.1_191204.run sbios_slot --get-nextboot-slot
    
  3. Perform actions based on the NextBoot and Currently Booted from slots

    • If the NextBoot slot and Currently Booted From slot are different, then reboot the system using ipmitool.

      telinit 1
      umount /raid
      sync
      ipmitool chassis power cycle
      
    • If the NextBoot slot and Currently Booted From slot are the same, then switch the NextBoot slot and then reboot as follows.

      sudo ./nvfw-dgx2_19.12.1_191204.run sbios_slot --switch-nextboot-slot
      telinit 1
      umount /raid
      sync
      ipmitool chassis power cycle
      
  4. Switch the NextBoot slot again and reboot to return to the original SBIOS.

    sudo ./nvfw-dgx2_19.12.1_191204.run sbios_slot --switch-nextboot-slot
    telinit 1
    umount /raid
    sync
    ipmitool chassis power cycle
    
  5. Verify the version strings in the primary and secondary slots are restored to their correct values.

    sudo ./nvfw-dgx2_19.12.1_191204.run show_version
    

Watchdog Timeout Due to Corrupt SBIOS

Issue

If an SBIOS is corrupt, the system will not be able to boot from it. In this case, when attempting to boot from the corrupt SBIOS, a watchdog timeout occurs and then the system boots from the alternate SBIOS. If the system is then rebooted, the system will attempt to boot from the original SBIOS, timeout again, then boot from the alternate SBIOS.

To confirm that a watchdog timeout has occurred,

  1. Issue the following.

    sudo ./nvfw-dgx2_19.12.1_191204.run show_version
    sudo cat /var/log/nvidia-fw.log | grep "EEPROM detection status 1" -n1
    
  2. Inspect byte 14 from the last EEPROM struct entry in the output.

    If byte 14 (bold-italicized in the following example) is 01, then a watchdog timeout has occurred.

    {EEPROM struct :00 00 16 00 00 18 00 01 03 01 03 01 22 01 01 a5}
    

Resolution

If the SBIOS is corrupted, you can re-flash the SBIOS from the BMC dashboard. See Updating the SBIOS from the BMC Dashboard for instructions.

Network Connection May Get Lost When Connected to Virtual Media

Issue

After connecting to virtual media as follows,

  1. Log in to BMC dashboard.

  2. Click Remote Control > Launch KVM.

  3. Connect to an ISO image and then click Launch Media.

while running a program from the virtual media, connection may get lost.

Resolution and Workaround

NVIDIA is currently investigating this issue for resolution in a later software release. To work around, connect with the software using a USB. Refer to the DGX-2 System User Guide: Creating a Bootable Installation Medium for instructions on creating a bootable USB.

NVSM Erroneously Reports PSUs and Fans as Unhealthy

Issue

After updating the BMC to version 1.05.07, output from nvsm show                                     health reports PSUs and Fans as “unhealthy” and that they cannot be detected, even though they are fine as indicated when using ipmitool. This occurs with DGX OS versions 4.1.1 and earlier.

Explanation

The “unhealthy” status is erroneous and does not impact functionality. The issue will be resolved in the next DGX OS release subsequent to patch update 4.1.1.

BMC UI May Stop Responding

Issue

Occasionally, the BMC web interface will stop responding, as indicated by the spinning progress bar and “Processing” text. This can happen at the login screen and also after logging in.

Recovery

The system OS is not affected, and the BMC itself is responsive to ipmitool commands.

To recover, reset the BMC using any of the following methods.

  • Via SSH connection to the system, with sudo access, enter the followings:

    ~sudo ipmitool mc reset cold
    
  • Via IPMI over a network, enter the following:

    ~ipmitool -I lan -H <bmc-ip-address> -U <user> -P <password> mc reset cold
    
  • If you have physical access to the system, press the BMC reset button.

    Refer to item 9 in the following image of the back of the DGX-2 system for the location of the BMC reset button.

    _images/bmc-reset-button.png

VBIOS Not Updated on DGX KVM Host

DGX-1 Known Issue

Issue

On a DGX-2 System that has been converted to a DGX KVM host, the VBIOS will not get updated if the GPU is being used by a guest GPU VM.

Explanation

All guest GPU VMs must be stopped before running the container to update the VBIOS. To stop the VMs, run the following from the KVM host for each guest GPU VM.

virsh shutdown <vm-domain>

Backup SBIOS Version at 0.0

Issue

The BMC dashboard incorrectly reports the backup SBIOS version to be 0.0.

Explanation

Due to a limitation in the BMC software, the software does not know the version of the backup SBIOS since it has not been run.