DGX-2 System Firmware Changes

This chapter contains the list of changes for the following DGX-2 firmware components.

DGX-2 BMC Changes

Changes in 01.08.00

Changes in 01.07.2

  • BMC SEL entries can now be saved to a remote syslog server..
  • Fixed an issue where backup/restore function does not work.
  • Fixed an issue where the MB CPLD, VBIOS, and PLX switch EEPROM versions are missing from the BMC web UI.
  • Added ability to get status of Power supply redundancy via SNMP.
  • Fixed an issue where resetting a BMC password (via the BMC web UI) to an 'easy' password results in an error.
  • Fixed an issue where, if the BMC was updated without preserving all the settings, then updating the SBIOS would result in losing all SBIOS settings.

Changes in 01.06.6

  • Fixed an issue that caused the nvsm health service to fail due to not being able to recognize the platform. Platform details were unpopulated with 'To be filled by O.E.M.'

  • Added code to report when fans are increased to 80% and identify the sensor that triggers it.

  • Added code to track GPU page retirements.

  • The system now boots from the secondary BMC image if there is a problem booting from the primary image.

  • Added CPLD Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Added CPU CATERR Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Added FAN DIAG Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Fixed BMC errors not getting sent to syslog facility (splunk).

  • Improved robustness of BMC update.

  • Implemented RESTful API to detect and restore EEPROM content during update.

  • Fixed issue where a failed BMC update rendered the BMC unreachable.

  • Fixed issue where a failed BMC update rendered the system unable to be powered on.

  • Fixed issue where the BMC stops logging SEL events if it cannot receive a timestamp from the Intel Management Engine (ME).

  • Fixed issue where updates from BMC version 1.05.07 fail, resulting in the system unable to boot.

  • Fixed battery voltage lower thresholds.

  • Fixed inability to change MaxQ/MaxP power mode.

  • Fixed inability to configure the BMC event filter from the Web UI.

  • Removed FQDN for LDAP in the Web UI.

  • Fixed an issue where the BMC was not resetting the SBIOS-fail-safe flag after recovery from a boot failure.

  • Fixed an issue where the system still booted when GPU fans from multiple fan zones are removed.

  • Fixed missing interface name in Web UI Service Configuration page.

  • Fixed missing NVSwitch temperature sensor readings.

  • Fixed BMC occasionally disabling GPU temperature sensors.

  • Fixed issue where SEL time-stamped after 9/21 shows "Pre-Init Time-stamp" instead of log info.
  • Fixed an issue where the BMC erroneously reported the backup SBIOS version is 0.0.

Changes in 01.05.20

  • SNMP RO/RW string now disabled by default.

Changes in 01.05.12

  • Added LDAPS (secure LDAP) support.
  • Resolved network connection getting lost when connected to virtual media.
  • Resolved an issue where occasionally the BMC UI would stop responding.

Changes in 01.05.10

  • Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.
  • Fixed BMC configuration settings not getting applied to both primary and secondary images.
  • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.
  • Fixed issue recovering corrupted firmware on Delta PSU.
  • Fixed BMC web UI reporting BIOS information incorrectly.
  • Fixed BMC Web UI reporting backup BMC version incorrectly.
  • Fixed cryptic BMC entries.
  • Added BMC capture logs from CPLD/FPGA during power on.
  • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).
  • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.
  • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.
  • Fixed how the BMC responds when it cannot read a temperature sensor.
  • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.
  • Fixed the BMC to provide more meaningful and useful SEL logs.
  • Fixed the GPU sensor name on baseboard 2 to match the service label.
  • Changed the naming of U.2 SSDs from "NVME" to "U.2".-
  • Resolved BMC SNMP community string limitations.

Changes in 01.04.03

  • Fixed BMC Update Timeout issue.
  • Fixed BMC configuration backup/restore function not working properly.
  • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.
  • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.
  • Fixed system fans running at 80% after hot-plugging an NVMe drive.
  • Fixed system shutting down after hot-unplugging one of the fans.
  • Fixed system unable to boot after updating BMC image while one BMC module is removed.
  • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.
  • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.
  • Fixed missing DIMM information in the BMC dashboard.
  • Fixed blinking amber-colored power LED.
  • Fixed BMC update freeze while updating using Yafuflash.
  • Fixed issues responding to 3.3V/5V/12V sensors.
  • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports 'device disabled'.
  • The BMC now saves CPU MCA registers when it detects a fatal MCA error.

Changes in 01.00.01

  • Fixed BMC update via dashboard erroneously preserving the configuration.
  • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.
  • Added dual FPGA image container update support.
  • Added PSU firmware container update support.
  • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.
  • Added support for FPGA update of Image #1 to the BMC dashboard.
  • Added VLAN support to the BMC dashboard.

DGX-2 SBIOS Changes

Changes in 0.33

  • Included microcode updated version 0x2006e05, resolving potential processor hangs with CATERR on specific workloads while Hyperthreading was enabled.
  • The SBIOS update includes software security enhancements. See the NVIDIA Security Bulletin for details.

Changes in 0.29

  • Implemented updates to address vulnerabilities in the SSA option, Intel CSME, SPS, and TXE.
  • Fixed an issue where updating the SBIOS via the BMC web UI resets the EFI varstore, which removes any changes made to use a signed bootloader.

Changes in 0.26

  • Fixed issue where the system would not boot if one NVMe drive was bad.
  • Fixed issue where each DGX-2 product UUID is not unique.

Changes in 0.24

  • Fixed erroneous events getting logged after system cold reboot.
  • Incorporated Intel microcode to mitigate new side channel attacks (Zombieland).
  • Fixed boot failure when BMC Self Test Status is "Failed".
  • Re-enabled Hyperthreading option in SBIOS.
  • Fixed SMBIOS type 9 tables not filled in properly.

Changes in 0.22

  • Fixed system failing to switch to backup SBIOS when initial boot fails.
  • Fixed enp6s0 network disappearing after enabling M.2 module hot plug in the SBIOS settings.
  • Fixed system unable to boot after replacing a DIMM.
  • Updated the boot recovery process when BMC remains unresponsive during boot. If BMC reset fails, then boot to SBIOS setup menu.
  • Fixed the default PCIe Corrected Error Threshold Counter setting to be enabled.

Changes in 0.17

  • Added SBIOS support for recovering degraded PCIe link during system boot.
  • Enhanced debug capability and support for faster resolution of customer cases via fully decoded MCA, Memory, POST and PCIe SEL events.
  • Developed in-memory PCIe topology in SBIOS to avoid full PCIe scan in turn eliminating unexpected Unsupported requests (PCIe Correctable errors).
  • Enable Error Logging options (enable or disable verbose logging) in SBIOS setup menu.
  • Added support for changing boot order using standard IPMI interface.

DGX-2 PSU Changes

Changes in 2.7

  • Fixed power-factor and load-balancing issues.
  • Fixed PSU not getting powered on.

Changes in 2.5

  • Fixed power load balancing issue at light loads.
  • Fixed power factor on the PDU showing low value which affects outlet wattage.
  • Fixed issue in COM firmware that may cause a bootloader failure while updating from older PSU FW.Fixed BMC Update Timeout issue.

DGX-2 U.2 Firmware Release Notes

Changes in 11300DU0

Resolves intermittent IO timeout issue and NVME controller reset event during high disk load.

Changes in 101008S0

  • Increased robustness of host read error handling.
  • Corrected adaptive tracking of NAND read thresholds.
  • Corrected potential invalid media error reported to host during power cycling.
  • Implemented general error handling and stability improvements.