Version 20.10.7
The DGX Firmware Update container version 20.10.7 is available.
Package name:
nvfw-dgx2_20.10.7_201023.tar.gz
Run file name:
nvfw-dgx2_20.10.7_201023.run
Image name:
nvfw-dgx2:20.10.7
Highlights and Changes in this Release
This release is supported with the following DGX OS software -
DGX OS 4.3 or later
Before using the container to update firmware on DGX OS later than 4.99.x, first stop certain NVIDIA services. See Special Instructions for all Updates”.
EL7-19.11 or later
Incorporates security updates for the BMC.
See the NVIDIA Security Bulletin 5010 for details.
Incorporates updated component firmware.
See Contents of the DGX-2 System Firmware Container for the list of changes.
Contents of the DGX-2 System Firmware Container
This container includes the firmware binaries and update utilities for the firmware listed in the following table.
Component |
Version | Key Changes |
||
---|---|---|---|
BMC |
1.06.06 | See BMC Release Notes for the list of changes. |
Change to the Update Process
Originally, only certain firmware components, such as the SBIOS, required rebooting the system after performing the update.
In order to ensure that all DGX-2 services continue running, you must reboot the DGX-2 after any firmware update for any component or group of components.
Updating Components with Secondary Images
Some firmware components provide a secondary image as backup. The following is the policy when updating those components:
SBIOS: Only the primary image is updated. To update both images, follow the instructions at Special Instructions for PSU, SBIOS, and BMC Firmware Updates.
BMC: Only the primary image is updated. To update the secondary (backup) image, include the
--update-backup-bmc
option in the update command.FPGA: Only the primary image is updated.
Enabling SNMP RO/RW Strings
The SNMP RO/RW strings are disabled by default. The following table provides the ipmitool arguments for enabling the strings. After enabling, disabling, or setting the RO/RW strings, either issue the restart SNMP Server command or reset the BMC for the changes to go into effect.
LUN |
CMD |
Requested Data |
|
---|---|---|---|
Offset |
Description |
||
3Ch/00h |
26h |
1 |
00h: Enable RO string 01h: Enable RW string 02h: Disable RO string 03h: Disable RW string 04h: Set RO string 05h: Set RW string 06h: Start SNMP Server 07h: Stop SNMP Server 08h: Restart SNMP Server |
2:21 |
Community string in ASCII code. Maximum string length is 20 characters. If request byte is set to 0x4 or 0x5, but empty from byte 2 to byte 21, then the corresponding community string will be cleared. |
For example, to enable the RO String, set the Community to “test”, and then restart the SNMP service on the BMC as follows:
Enable RO.
sudo ipmitool raw 0x3c 0x26 0x00
Set the RO string to “test”.
sudo ipmitool raw 0x3c 0x26 0x04 0x74 0x65 0x73 0x74
Restart the SNMP service on the BMC.
sudo ipmitool raw 0x3c 0x26 0x08
Special Instructions for all Updates
ating Firmware on DGX Systems Installed with DGX OS Release Later than 4.99.x
You need to stop certain NVIDIA services before using the container to update firmware on systems installed with DGX OS later than 4.99.x.
If you run the container using either the
docker run
or.run
file method, then stop services first by issuing the following.
sudo systemctl stop nvsm dcgm nvidia-fabricmanager nvidia-persistenced.service
If you run the container using NVSM CLI, then stop services first by issuing the following (does not include stopping nvsm).
sudo systemctl stop dcgm nvidia-fabricmanager nvidia-persistenced.service
Special Instructions for PSU, SBIOS, and BMC Firmware Updates
Before updating the PSU, SBIOS, or the BMC, refer to the following special instructions for guidance to ensure the updates are successful.
PSU Updates
If the BMC version is older than 01.00.01, then the BMC must be updated first before updating the PSU. See Updating the BMC from Versions older than 01.00.01.
SBIOS Updates
If the current BMC is version 1.05.7, then update the BMC first before updating the SBIOS.
To update both primary and secondary SBIOS (after updating the BMC) using the container, do the following (assumes the primary SBIOS is the current, active SBIOS).
Refer to “Special Instructions for all Updates” to see if services need to be stopped and how to do it.
Update the active SBIOS using the
update_fw SBIOS
argument from the firmware update container.Designate booting from the secondary (inactive) SBIOS on the next boot.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:20.08.8 sbios_slot --switch-nextboot-slot
Reboot the DGX-2 to switch to the secondary SBIOS.
sudo telinit 1 sudo umount /raid sync sudo ipmitool chassis power cycle
Update the secondary (now active) SBIOS.
Designate booting from the primary SBIOS on the next boot (to restore the primary SBIOS as the active SBIOS).
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:20.08.8 sbios_slot --switch-nextboot-slot
Reboot the DGX-2 to switch back to the primary SBIOS.
sudo telinit 1 sudo umount /raid sync sudo ipmitool chassis power cycle
BMC Updates
If the current BMC is older than 01.00.01, then follow the instructions at Updating the BMC from Versions older than 01.00.01.
If the current BMC is 01.00.01, then follow the instructions at Updating the BMC from Version 01.00.01.
Known Issues
SBIOS Intel ME Setting Version Does Not Get Updated
Issue
The Intel ME firmware has changed since SBIOS 0.17, but updating the SBIOS from 0.17 does not update the ME firmware.
Resolution
To update the Intel ME firmware, do not update the SBIOS using the firmware update container. Instead, use the BMC dashboard. See Updating the SBIOS from the BMC Dashboard for instructions.
After updating the SBIOS, verify that the Intel ME setting version has been updated by issuing the following.
# sudo dmidecode --type 11
//output
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Handle 0x0055, DMI type 11, 5 bytes
OEM Strings
String 1: 4.0.4.313.1
Verify that the last digit in String 1
is “1” as in the example output.
note:: The Intel ME setting version is stored in the SBIOS, and available for viewing, only with SBIOS version 0.24 and later.
EEPROM Checksum Mismatch
Issue
BMC version 1.05.7 introduced an issue that could cause corruption in the BMC EEPROM. This is indicated by an EEPROM checksum mismatch
error message when attempting to update any firmware.
You can also verify EEPROM corruption by issuing the following .. code:: text
sudo docker run –rm –privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 show_version
and then viewing the output for the error message.
note:: This error may be reported if a corrupt SBIOS produces a watchdog timeout during boot. In this case, the error message is erroneous. See the section Watchdog Timeout Due to Corrupt SBIOS for instructions on confirming and then resolving the SBIOS corruption.
Resolution
The DGX-2 Firmware Update Container version 21.06.7 includes logic to detect and repair the corruption. Perform the following steps to repair the EEPROM corruption.
If the BMC is not already updated, then update the BMC.
Refer to Special Instructions for all Updates to see if services need to be stopped and how to do it.
Review the “current” and “next” boot SBIOS by issuing the following.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --get-nextboot-slot
Perform actions based on the NextBoot and Currently Booted from slots
If the NextBoot slot and Currently Booted From slot are different, then reboot the system using ipmitool.
telinit 1 umount /raid sync ipmitool chassis power cycle
If the NextBoot slot and Currently Booted From slot are the same, then switch the NextBoot slot and then reboot as follows.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot telinit 1 umount /raid sync ipmitool chassis power cycle
Switch the NextBoot slot again and reboot to return to the original SBIOS.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 sbios_slot --switch-nextboot-slot telinit 1 umount /raid sync ipmitool chassis power cycle
Verify the version strings in the primary and secondary slots are restored to their correct values.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:21.06.7 show_version
Watchdog Timeout Due to Corrupt SBIOS
DGX-1 Known Issue
Issue
If an SBIOS is corrupt, the system will not be able to boot from it. In this case, when attempting to boot from the corrupt SBIOS, a watchdog timeout occurs and then the system boots from the alternate SBIOS. If the system is then rebooted, the system will attempt to boot from the original SBIOS, timeout again, then boot from the alternate SBIOS.
To confirm that a watchdog timeout has occurred,
Issue the following.
sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dg2:20.10.7 show_version sudo cat /var/log/nvidia-fw.log | grep "EEPROM detection status 1" -n1
Inspect byte 14 from the last
EEPROM struct
entry in the output.If byte 14 (bold-italicized in the following example) is
01
, then a watchdog timeout has occurred.{EEPROM struct :00 00 16 00 00 18 00 01 03 01 03 01 22 01 01 a5}
Resolution
If the SBIOS is corrupted, you can re-flash the SBIOS from the BMC dashboard. See Updating the SBIOS from the BMC Dashboard for instructions.
VBIOS Not Updated on DGX KVM Host
DGX-1 Known Issue
Issue
On a DGX-2 System that has been converted to a DGX KVM host, the VBIOS will not get updated if the GPU is being used by a guest GPU VM.
Explanation
All guest GPU VMs must be stopped before running the container to update the VBIOS. To stop the VMs, run the following from the KVM host for each guest GPU VM.
virsh shutdown <vm-domain>