DGX-2 System Firmware Changes
This topic contains the list of changes for the following DGX-2 firmware components.
BMC: See DGX-2 BMC Changes.
System BIOS: See DGX-2 SBIOS Changes.
Power Supply Units: See DGX-2 PSU Changes.
DGX-2 U.2 Firmware Release Notes: See DGX-2 U.2 Firmware Release Notes.
DGX-2 BMC Changes
Changes in 01.09.00
Increased the configurable maximum limit of KVM idle timeout to 10,800 seconds. This avoids timeouts while mounting large ISO images over a slow BMC network.
The following table lists a potential security vulnerability that has been reported by AMI. It is addressed in DGX-2 BMC version 01.09.00.
Affected BMC versions: All BMC versions prior to 01.09.00
Updated BMC version: 01.09.00
Firmware container version: 24.03.1
CVE IDs Addressed
Vendor (per NVD)
CVE-2023-37293
AMI
Changes in 01.08.00
BMC update includes software security enhancements. See the NVIDIA Security Bulletin for details.
Changes in 01.07.2
BMC SEL entries can now be saved to a remote syslog server..
Fixed an issue where backup/restore function does not work.
Fixed an issue where the MB CPLD, VBIOS, and PLX switch EEPROM versions are missing from the BMC web UI.
Added ability to get status of Power supply redundancy via SNMP.
Fixed an issue where resetting a BMC password (via the BMC web UI) to an ‘easy’ password results in an error.
Fixed an issue where, if the BMC was updated without preserving all the settings, then updating the SBIOS would result in losing all SBIOS settings.
Changes in 01.06.6
Fixed an issue that caused the nvsm health service to fail due to not being able to recognize the platform. Platform details were unpopulated with ‘To be filled by O.E.M.’
Added code to report when fans are increased to 80% and identify the sensor that triggers it.
Added code to track GPU page retirements.
The system now boots from the secondary BMC image if there is a problem booting from the primary image.
Added CPLD Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.
Added CPU CATERR Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.
Added FAN DIAG Dump option to the BMC Web UI Maintenance > Diagnostic Dump Data page.
Fixed BMC errors not getting sent to syslog facility (splunk).
Improved robustness of BMC update.
Implemented RESTful API to detect and restore EEPROM content during update.
Fixed issue where a failed BMC update rendered the BMC unreachable.
Fixed issue where a failed BMC update rendered the system unable to be powered on.
Fixed issue where the BMC stops logging SEL events if it cannot receive a timestamp from the Intel Management Engine (ME).
Fixed issue where updates from BMC version 1.05.07 fail, resulting in the system unable to boot.
Fixed battery voltage lower thresholds.
Fixed inability to change MaxQ/MaxP power mode.
Fixed inability to configure the BMC event filter from the Web UI.
Removed FQDN for LDAP in the Web UI.
Fixed an issue where the BMC was not resetting the SBIOS-fail-safe flag after recovery from a boot failure.
Fixed an issue where the system still booted when GPU fans from multiple fan zones are removed.
Fixed missing interface name in Web UI Service Configuration page.
Fixed missing NVSwitch temperature sensor readings.
Fixed BMC occasionally disabling GPU temperature sensors.
Fixed issue where SEL time-stamped after 9/21 shows “Pre-Init Time-stamp” instead of log info.
Fixed an issue where the BMC erroneously reported the backup SBIOS version is 0.0.
Changes in 01.05.20
SNMP RO/RW string now disabled by default.
Changes in 01.05.12
Added LDAPS (secure LDAP) support.
Resolved network connection getting lost when connected to virtual media.
Resolved an issue where occasionally the BMC UI would stop responding.
Changes in 01.05.10
Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.
Fixed BMC configuration settings not getting applied to both primary and secondary images.
Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.
Fixed issue recovering corrupted firmware on Delta PSU.
Fixed BMC web UI reporting BIOS information incorrectly.
Fixed BMC Web UI reporting backup BMC version incorrectly.
Fixed cryptic BMC entries.
Added BMC capture logs from CPLD/FPGA during power on.
Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).
Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.
Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.
Fixed how the BMC responds when it cannot read a temperature sensor.
Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.
Fixed the BMC to provide more meaningful and useful SEL logs.
Fixed the GPU sensor name on baseboard 2 to match the service label.
Changed the naming of U.2 SSDs from “NVME” to “U.2”.-
Resolved BMC SNMP community string limitations.
Changes in 01.04.03
Fixed BMC Update Timeout issue.
Fixed BMC configuration backup/restore function not working properly.
Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.
Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.
Fixed system fans running at 80% after hot-plugging an NVMe drive.
Fixed system shutting down after hot-unplugging one of the fans.
Fixed system unable to boot after updating BMC image while one BMC module is removed.
Fixed incorrect SEL timestamp after executing ipmi mc reset cold.
Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.
Fixed missing DIMM information in the BMC dashboard.
Fixed blinking amber-colored power LED.
Fixed BMC update freeze while updating using Yafuflash.
Fixed issues responding to 3.3V/5V/12V sensors.
Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports ‘device disabled’.
The BMC now saves CPU MCA registers when it detects a fatal MCA error.
Changes in 01.00.01
Fixed BMC update via dashboard erroneously preserving the configuration.
Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.
Added dual FPGA image container update support.
Added PSU firmware container update support.
Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.
Added support for FPGA update of Image #1 to the BMC dashboard.
Added VLAN support to the BMC dashboard.
DGX-2 SBIOS Changes
Changes in 0.33
Included microcode updated version 0x2006e05, resolving potential processor hangs with CATERR on specific workloads while Hyperthreading was enabled.
The SBIOS update includes software security enhancements. See the NVIDIA Security Bulletin for details.
Changes in 0.29
Implemented updates to address vulnerabilities in the SSA option, Intel CSME, SPS, and TXE.
Fixed an issue where updating the SBIOS via the BMC web UI resets the EFI varstore, which removes any changes made to use a signed bootloader.
Changes in 0.26
Fixed issue where the system would not boot if one NVMe drive was bad.
Fixed issue where each DGX-2 product UUID is not unique.
Changes in 0.24
Fixed erroneous events getting logged after system cold reboot.
Incorporated Intel microcode to mitigate new side channel attacks (Zombieland).
Fixed boot failure when BMC Self Test Status is “Failed”.
Re-enabled Hyperthreading option in SBIOS.
Fixed SMBIOS type 9 tables not filled in properly.
Changes in 0.22
Fixed system failing to switch to backup SBIOS when initial boot fails.
Fixed enp6s0 network disappearing after enabling M.2 module hot plug in the SBIOS settings.
Fixed system unable to boot after replacing a DIMM.
Updated the boot recovery process when BMC remains unresponsive during boot. If BMC reset fails, then boot to SBIOS setup menu.
Fixed the default PCIe Corrected Error Threshold Counter setting to be enabled.
Changes in 0.17
Added SBIOS support for recovering degraded PCIe link during system boot.
Enhanced debug capability and support for faster resolution of customer cases via fully decoded MCA, Memory, POST and PCIe SEL events.
Developed in-memory PCIe topology in SBIOS to avoid full PCIe scan in turn eliminating unexpected Unsupported requests (PCIe Correctable errors).
Enable Error Logging options (enable or disable verbose logging) in SBIOS setup menu.
Added support for changing boot order using standard IPMI interface.
DGX-2 PSU Changes
Changes in 2.7
Fixed power-factor and load-balancing issues.
Fixed PSU not getting powered on.
Changes in 2.5
Fixed power load balancing issue at light loads.
Fixed power factor on the PDU showing low value which affects outlet wattage.
Fixed issue in COM firmware that may cause a bootloader failure while updating from older PSU FW.Fixed BMC Update Timeout issue.
DGX-2 U.2 Firmware Release Notes
Changes in 11300DU0
Resolves intermittent IO timeout issue and NVME controller reset event during high disk load.
Changes in 101008S0
Increased robustness of host read error handling.
Corrected adaptive tracking of NAND read thresholds.
Corrected potential invalid media error reported to host during power cycling.
Implemented general error handling and stability improvements.