DGX-2 System Firmware Changes

This topic contains the list of changes for the following DGX-2 firmware components.

DGX-2 BMC Changes

Changes in 01.09.00

  • Increased the configurable maximum limit of KVM idle timeout to 10,800 seconds. This avoids timeouts while mounting large ISO images over a slow BMC network.

  • The following table lists a potential security vulnerability that has been reported by AMI. It is addressed in DGX-2 BMC version 01.09.00.

    • Affected BMC versions: All BMC versions prior to 01.09.00

    • Updated BMC version: 01.09.00

    • Firmware container version: 24.03.1

      CVE IDs Addressed

      Vendor (per NVD)

      CVE-2023-37293

      AMI

Changes in 01.08.00

Changes in 01.07.2

  • BMC SEL entries can now be saved to a remote syslog server..

  • Fixed an issue where backup/restore function does not work.

  • Fixed an issue where the MB CPLD, VBIOS, and PLX switch EEPROM versions are missing from the BMC web UI.

  • Added ability to get status of Power supply redundancy via SNMP.

  • Fixed an issue where resetting a BMC password (via the BMC web UI) to an ‘easy’ password results in an error.

  • Fixed an issue where, if the BMC was updated without preserving all the settings, then updating the SBIOS would result in losing all SBIOS settings.

Changes in 01.06.6

  • Fixed an issue that caused the nvsm health service to fail due to not being able to recognize the platform. Platform details were unpopulated with ‘To be filled by O.E.M.’

  • Added code to report when fans are increased to 80% and identify the sensor that triggers it.

  • Added code to track GPU page retirements.

  • The system now boots from the secondary BMC image if there is a problem booting from the primary image.

  • Added CPLD Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Added CPU CATERR Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Added FAN DIAG Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.

  • Fixed BMC errors not getting sent to syslog facility (splunk).

  • Improved robustness of BMC update.

  • Implemented RESTful API to detect and restore EEPROM content during update.

  • Fixed issue where a failed BMC update rendered the BMC unreachable.

  • Fixed issue where a failed BMC update rendered the system unable to be powered on.

  • Fixed issue where the BMC stops logging SEL events if it cannot receive a timestamp from the Intel Management Engine (ME).

  • Fixed issue where updates from BMC version 1.05.07 fail, resulting in the system unable to boot.

  • Fixed battery voltage lower thresholds.

  • Fixed inability to change MaxQ/MaxP power mode.

  • Fixed inability to configure the BMC event filter from the Web UI.

  • Removed FQDN for LDAP in the Web UI.

  • Fixed an issue where the BMC was not resetting the SBIOS-fail-safe flag after recovery from a boot failure.

  • Fixed an issue where the system still booted when GPU fans from multiple fan zones are removed.

  • Fixed missing interface name in Web UI Service Configuration page.

  • Fixed missing NVSwitch temperature sensor readings.

  • Fixed BMC occasionally disabling GPU temperature sensors.

  • Fixed issue where SEL time-stamped after 9/21 shows “Pre-Init Time-stamp” instead of log info.

  • Fixed an issue where the BMC erroneously reported the backup SBIOS version is 0.0.

Changes in 01.05.20

  • SNMP RO/RW string now disabled by default.

Changes in 01.05.12

  • Added LDAPS (secure LDAP) support.

  • Resolved network connection getting lost when connected to virtual media.

  • Resolved an issue where occasionally the BMC UI would stop responding.

Changes in 01.05.10

  • Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.

  • Fixed BMC configuration settings not getting applied to both primary and secondary images.

  • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.

  • Fixed issue recovering corrupted firmware on Delta PSU.

  • Fixed BMC web UI reporting BIOS information incorrectly.

  • Fixed BMC Web UI reporting backup BMC version incorrectly.

  • Fixed cryptic BMC entries.

  • Added BMC capture logs from CPLD/FPGA during power on.

  • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).

  • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.

  • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.

  • Fixed how the BMC responds when it cannot read a temperature sensor.

  • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.

  • Fixed the BMC to provide more meaningful and useful SEL logs.

  • Fixed the GPU sensor name on baseboard 2 to match the service label.

  • Changed the naming of U.2 SSDs from “NVME” to “U.2”.-

  • Resolved BMC SNMP community string limitations.

Changes in 01.04.03

  • Fixed BMC Update Timeout issue.

  • Fixed BMC configuration backup/restore function not working properly.

  • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.

  • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.

  • Fixed system fans running at 80% after hot-plugging an NVMe drive.

  • Fixed system shutting down after hot-unplugging one of the fans.

  • Fixed system unable to boot after updating BMC image while one BMC module is removed.

  • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.

  • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.

  • Fixed missing DIMM information in the BMC dashboard.

  • Fixed blinking amber-colored power LED.

  • Fixed BMC update freeze while updating using Yafuflash.

  • Fixed issues responding to 3.3V/5V/12V sensors.

  • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports ‘device disabled’.

  • The BMC now saves CPU MCA registers when it detects a fatal MCA error.

Changes in 01.00.01

  • Fixed BMC update via dashboard erroneously preserving the configuration.

  • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.

  • Added dual FPGA image container update support.

  • Added PSU firmware container update support.

  • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.

  • Added support for FPGA update of Image #1 to the BMC dashboard.

  • Added VLAN support to the BMC dashboard.

DGX-2 SBIOS Changes

Changes in 0.33

  • Included microcode updated version 0x2006e05, resolving potential processor hangs with CATERR on specific workloads while Hyperthreading was enabled.

  • The SBIOS update includes software security enhancements. See the NVIDIA Security Bulletin for details.

Changes in 0.29

  • Implemented updates to address vulnerabilities in the SSA option, Intel CSME, SPS, and TXE.

  • Fixed an issue where updating the SBIOS via the BMC web UI resets the EFI varstore, which removes any changes made to use a signed bootloader.

Changes in 0.26

  • Fixed issue where the system would not boot if one NVMe drive was bad.

  • Fixed issue where each DGX-2 product UUID is not unique.

Changes in 0.24

  • Fixed erroneous events getting logged after system cold reboot.

  • Incorporated Intel microcode to mitigate new side channel attacks (Zombieland).

  • Fixed boot failure when BMC Self Test Status is “Failed”.

  • Re-enabled Hyperthreading option in SBIOS.

  • Fixed SMBIOS type 9 tables not filled in properly.

Changes in 0.22

  • Fixed system failing to switch to backup SBIOS when initial boot fails.

  • Fixed enp6s0 network disappearing after enabling M.2 module hot plug in the SBIOS settings.

  • Fixed system unable to boot after replacing a DIMM.

  • Updated the boot recovery process when BMC remains unresponsive during boot. If BMC reset fails, then boot to SBIOS setup menu.

  • Fixed the default PCIe Corrected Error Threshold Counter setting to be enabled.

Changes in 0.17

  • Added SBIOS support for recovering degraded PCIe link during system boot.

  • Enhanced debug capability and support for faster resolution of customer cases via fully decoded MCA, Memory, POST and PCIe SEL events.

  • Developed in-memory PCIe topology in SBIOS to avoid full PCIe scan in turn eliminating unexpected Unsupported requests (PCIe Correctable errors).

  • Enable Error Logging options (enable or disable verbose logging) in SBIOS setup menu.

  • Added support for changing boot order using standard IPMI interface.

DGX-2 PSU Changes

Changes in 2.7

  • Fixed power-factor and load-balancing issues.

  • Fixed PSU not getting powered on.

Changes in 2.5

  • Fixed power load balancing issue at light loads.

  • Fixed power factor on the PDU showing low value which affects outlet wattage.

  • Fixed issue in COM firmware that may cause a bootloader failure while updating from older PSU FW.Fixed BMC Update Timeout issue.

DGX-2 U.2 Firmware Release Notes

Changes in 11300DU0

Resolves intermittent IO timeout issue and NVME controller reset event during high disk load.

Changes in 101008S0

  • Increased robustness of host read error handling.

  • Corrected adaptive tracking of NAND read thresholds.

  • Corrected potential invalid media error reported to host during power cycling.

  • Implemented general error handling and stability improvements.