DGX A100 Firmware Changes

This chapter contains the list of changes for the following DGX A100 firmware components.

DGX A100 BMC Changes

Changes in 00.22.05

  • Improved the security of BMC Redfish Host Interface and KVM interfaces.

  • Improved the correctness and accuracy of Data Center Infrastructure Management (DCMI) power sensor value reporting.

  • Implemented a mechanism to user-initiated BMC resets while a firmware update action is in progress.

  • Fixed the username validation with respect to certain special characters for LDAP Authentication to avoid security vulnerabilities.

  • Resolved a security vulnerability issue in the Service Location Protocol (SLP) feature.

  • The following table lists potential security vulnerabilities that have been reported by AMI or third-party vendors. They are addressed in DGX A100 BMC version 00.22.05.

    • Affected BMC versions: All BMC versions prior to 00.22.05

    • Updated BMC version: 00.22.05

    • Firmware container version: 23.12.1

      CVE IDs Addressed

      Vendor (per NVD)

      CVE-2023-34472
      CVE-2023-34330
      CVE-2023-34329
      CVE-2023-28863

      AMI

      CVE-2021-44769

      Nozomi Networks Inc.

Changes in 00.21.01

  • Improved how sensor values are read to avoid intermittent errors. Previously, the sensor read errors could cause the system to power off unexpectedly.

Changes in 00.20.04

  • Fixed a BMC web interface login failure that reported a Session Expired message.

  • Updated validation of the LDAP configuration settings done via BMC web UI to match LDAP specification.

  • Enhanced the BMC to detect BIOS hangs before POST starts.

  • Improved the validation of new firewall rule addition using ipmitool.

  • The BMC update includes software security enhancements. See the NVIDIA Security Bulletin DGX - June 2023 for details.

Changes in 00.19.07

  • Added a new version of GPU baseboard support.

  • Improved SNMP trap handling and updated SNMP MIB with additional description for better trap information.

  • Handled a rare NTP server configuration settings issue from BMC WebUI.

  • The BMC update includes software security enhancements. See the NVIDIA Security Bulletin DGX - December 2022 for details.

Changes in 00.18.03

  • Added a new version of GPU baseboard support.

Changes in 00.17.07

  • Fixed an issue so that certain sensors are now displayed in the BMC Web UI.

  • Fixed the graceful handling of system power loss, which prevents the BMC Flash file system consistency issue and improves recovery.

  • Fixed issues that caused the BMC usage to dramatically increase, which resulted in a POST failure with error code 91 or B4.

  • Improves Redfish interface error handling.

  • Fixed the BMC Web UI security settings and page refresh during full screen mode.

  • Fixed BMC SEL Event page, which was causing an error in certain SEL record parsing.

  • Fixed an issue where the Power/Status LED was flashing continuously after the server was rebooted, and the Power/Status LED stayed on after the server was powered off.

Changes in 00.16.09

  • Fixed incorrect temperatures reported for sensors on the NVIDIA Networking ConnectX-6 single-port and dual-port VPI cards.

  • Fixed a bug to ensure that the BMC will boot to the latest version updated on the system.

  • Fixed SEL log not showing the correct BMC or SBIOS version after an update.

  • Added ability to set the BMC to local time instead of default UTC.

  • Added ability to sync local time to NTP servers. (enable NTP time sync).

  • Removed unnecessary SEL log messages pointing to high CPU power consumption.

  • Fixed “/” character not allowed in BMC web UI LDAP Role Group settings.

  • Added authentication capabilities to the RESTful API.

  • Added new capabilities to identify firmware updates in the System Event Log (SEL) via “NVIDIA-firmware” event.

    Adds SEL information for BMC (end), BIOS, CPLD, and PSU.

Changes in 00.14.17

  • Added support for second source SPI ROM.

Changes in 00.14.16

  • Fixed an issue where a cold boot might put the BMC in a non-bootable state.

  • Fixed BMC update failing with “Error flashing Inactive image 2: rc = 0x-9” ,

  • Fixed occasionally needing to log into the BMC WebUI twice.

  • Fixed the BMC dashboard system event filter not working.

  • Added ability to monitor Mellanox card transceiver temperatures and increase fan speeds.

  • Fixed inability to update the BMC after unexpected interruption.

  • Fixed missing memory, NIC and storage drive information.

Changes in 00.13.16

Changes in 00.13.04

  • Resolved increased fan speed that occurred when optional components are not installed, even when the system is idle.

DGX A100 SBIOS Changes

Changes in 1.25

  • The SBIOS update includes software security enhancements. Refer to the NVIDIA Security Bulletin DGX - December 2023 for details.

  • Added error reporting when the system is booted with incorrectly inserted Trusted Platform Module (TPM).

  • The following table lists potential security vulnerabilities that have been reported by AMI or third-party vendors. They are addressed in DGX A100 SBIOS version 1.25.

    • Affected SBIOS versions: All SBIOS versions prior to 1.25

    • Updated SBIOS version: 1.25

    • Firmware container version: 23.12.1

      CVE IDs Addressed

      Vendor (per NVD)

      CVE-2023-0465
      CVE-2021-38578
      CVE-2021-38576
      CVE-2021-38575
      CVE-2021-33164
      CVE-2019-14587
      CVE-2019-14586
      CVE-2019-14584
      CVE-2019-14563
      CVE-2019-14559
      CVE-2017-5715
      CVE-2014-4860
      CVE-2014-4859

      AMI

      CVE-2023-1018
      CVE-2023-1017

      CERT/CC

Changes in 1.21

Changes in 1.18

  • Added a new version of GPU baseboard support.

  • Fixed issues relating to redfish reporting of PCIe device types and speeds.

  • Removed unimplemented setup menu options for User Defaults and Boot NumLock State.

    Updated AGESA to version 1.0.0.E

    The SBIOS update includes software security enhancements. See the NVIDIA Security Bulletin DGX - December 2022 for details.

Changes in 1.13

  • Fixed two issues that were causing boot order settings to not be saved to the BMC if applied out-of-band, causing settings to be lost after a subsequent firmware update.

  • Added interactive countdown messages during boot, to display the Setup Prompt Timeout configurable through the**Boot** > Setup Prompt Timeout configuration menu.

  • Added reporting of AGESA Version in SMBIOS.

  • Updated AGESA to version 1.0.0.D.

Changes in 1.09

  • Fixed an issue where changes in the boot order are not preserved after updating the SBIOS.

  • Fixed inability to enter the SBIOS Admin/User password from the Serial Over LAN (SOL) console.

  • Fixed PXE boot configuration not persisting; helpful for multiple DGX A100 nodes.

  • Added Memory correctable ECC Error leaky bucket; prevents unnecessary replacement of working system DIMMs.

  • Fixed SBIOS Setup > Main page showing incorrect Admin/User Access level.

Changes in 0.34

  • Removed warning message that occurred when the system contained DIMMs from different vendors.

Changes in 0.33

  • Fixed mishandling of correctable PCIe errors.

Changes in 0.30

  • Added support for HTTP boot.

  • Updated DSP/USP preset values to address PCIe advanced error reporting (AER) issues.

  • Changed the following default settings.

    • Determinism Control > [Manual]

    • Determinism Slider > [Power]

    • cTDP Control > [Manual]

    • cTDP > [240]

    • Package Power Limit Control > [Manual]

    • Package Power Limit > [240]

    • DF Cstates > [Disabled]

DGX A100 U.2 NVMe Changes

Changes in EPK9CB5Q

  • Fixed drive going into read-only mode if there is a sudden power cycle while performing live firmware update.

  • Improved write performance while performing drive wear-leveling; shortens wear-leveling process time.

  • Fixed drive going into failed mode when a high number of uncorrectable ECC errors occurred.

DGX A100 Broadcom 88096 PCIe Switchboard Changes

Changes in 0.2.0

  • Fixed the incorrect setting of the switch’s Upstream Port Number as Port 0.

Changes in 1.8

  • Implemented tuning to address PCIe advanced error reporting (AER) issues.

Changes in 1.3

  • Disabled hot-plug and hot-plug surprise capability.

DGX A100 Broadcom 880xx Retimer Changes

Changes in 4.1.0

  • Updated configuration to support Delta baseboard D01.

Changes in 3.1.0

  • Fixed the issue that was reported in Broadcom v3.0 firmware.

Changes in 1.2f

  • Fixed an issue that caused NVQual to hang while loading the MODS driver.

Changes in 0.F.0

  • Improved error handling of downstream switches.

    This change modifies the PCIe topology and mapping. Refer to the DGX A100 User Guide for PCIe mapping details.

Changes in 0.13.0

  • Fixed DPC Notification behavior for Firmware First Platform.

DGX A100 VBIOS Changes

Changes in 92.00.81.00.01

  • Added support for the PG510 SXM module.

Changes in 92.00.45.00.03/05

  • Added security protection to the I2C interface.

Changes in 92.00.36.00.04

  • Fixed an issue allocating the BAR1 size across resets.

  • Fixed MIG capability not being reported correctly if the driver is not loaded; for example, if accessed out-of-band.

Changes in 92.00.19.00.10

  • Expanded support for potential alternate HBM sources.

Changes in 92.00.19.00.01

  • Fixed Xid 64 (Row Remapper Error)

DGX A100 BMC CEC Changes

Changes in 3.28

  • Fixed the update progress output reporting ” Update_timeout ” for the motherboard CEC (MB_CEC) when using the .run file without Docker installed.

  • Fixed the user’s configuration getting lost if the BMC updated failed.

DGX A100 BMC CEC SPI Changes

Changes in 01.05.12

  • Added LDAPS (secure LDAP) support.

  • Resolved network connection getting lost when connected to virtual media.

  • Resolved an issue where occasionally the BMC UI would stop responding.

Changes in 01.05.10

  • Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.

  • Fixed BMC configuration settings not getting applied to both primary and secondary images.

  • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.

  • Fixed issue recovering corrupted firmware on Delta PSU.

  • Fixed BMC web UI reporting BIOS information incorrectly.

  • Fixed BMC Web UI reporting backup BMC version incorrectly.

  • Fixed cryptic BMC entries.

  • Added BMC capture logs from CPLD/FPGA during power on.

  • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).

  • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.

  • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.

  • Fixed how the BMC responds when it cannot read a temperature sensor.

  • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.

  • Fixed the BMC to provide more meaningful and useful SEL logs.

  • Fixed the GPU sensor name on baseboard 2 to match the service label.

  • Changed the naming of U.2 SSDs from “NVME” to “U.2”.-

  • Resolved BMC SNMP community string limitations.

Changes in 01.04.03

  • Fixed BMC Update Timeout issue.

  • Fixed BMC configuration backup/restore function not working properly.

  • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.

  • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.

  • Fixed system fans running at 80% after hot-plugging an NVMe drive.

  • Fixed system shutting down after hot-unplugging one of the fans.

  • Fixed system unable to boot after updating BMC image while one BMC module is removed.

  • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.

  • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.

  • Fixed missing DIMM information in the BMC dashboard.

  • Fixed blinking amber-colored power LED.

  • Fixed BMC update freeze while updating using Yafuflash.

  • Fixed issues responding to 3.3V/5V/12V sensors.

  • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports ‘device disabled’.

  • The BMC now saves CPU MCA registers when it detects a fatal MCA error.

Changes in 01.00.01

  • Fixed BMC update via dashboard erroneously preserving the configuration.

  • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.

  • Added dual FPGA image container update support.

  • Added PSU firmware container update support.

  • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.

  • Added support for FPGA update of Image #1 to the BMC dashboard.

  • Added VLAN support to the BMC dashboard.

DGX A100 FPGA Release Notes

Features

  • Changes in 4.02

    • Added support to enable alternate HSC components on new Delta board revision D01.

    • Enable rollback protection against version 0, 1, and 2.

  • Changes in 03.14

    • FPGA (GPU sled)3.0e: this version of FPGA fixes a GPU Tray failing to power ON during system power cycle (LLC failures will be reported in SEL logs. System fails to boot or comes up with no GPUs ). The new versions of FPGA and CEC (GB sled) address this issue and we recommend every customer upgrade these components.

  • Changes in 2.A5

    • Fixed all reset domain crossing errors and warnings in SMBPBI controllers for better I2C stability.

    • Eliminated all the combo loops on all flops’ async resets.

    • Fixed certain timeout triggers in the clk buf module.

    • Fixed dropped or malformed requests that occurred as the result of a premature I2C interface release by the FPGA state machine.

    • Fixed the DC cycling issue that caused an PCIE enumeration failure due to clock buffer configuration.

    • Fixed the SMBPBI command to clear LLC C D Alerts.

    • Fixed the SMBPBI commands to control the individual GPU power brake.

    • Added proven de-hang logic for all state machines and I2C buses.

    • Added the metastability synchronizer for all GPIOs, including I2C buses, at the top level,

    • Updated the FRU Comparator for the chassis area write boundary for RTL.

    • Updated the I2C access policies for the register table to prevent unintended access.

    • Updated the reset polices for LR10.

    • Updated the initial 0x5C read command status to STATUS_READY.

    • Aligned all the return STATUSes to be compliant with the NonGPU SMBPBI document.

    • Updated inlet#2, inlet#1, and PEX8725 to configure the thermal parameter and interrupts to align with the spec.

    • Updated the PCIE buffer configuration logic so that it will not miss ACK or go into a livelock while GPU base power enable is toggling,

    • Updated the GPU forward Controller to match the GPU spec for the execute command.

    • Fixed the SMBPBI commands that were non-effective for PCIE SWITCH resets.

    • Updated the I2C Contention Mutex Logic for better resiliency and stability.

    • Improved the I2C timing from the FPGA I2C master modules.

DGX A100 Delta PSU Release Notes

Changes in 1.6/1.6/1.7

  • Fixed 0W reporting issue.