DGX A100 Firmware Changes

This chapter contains the list of changes for the following DGX A100 firmware components.

DGX A100 BMC Changes

Changes in 00.17.06

  • Fixed an issue so that certain sensors are now displaying in the BMC Web UI.
  • Fixed the graceful handling of system power loss, which prevents the BMC Flash file system consistency issue and improves recovery.
  • Fixed issues that caused the BMC usage to dramatically increase, which resulted in a POST failure with error code 91 or B4.

    This fix also improves the error handling in the Redfish interface.

  • Fixed the BMC Web UI security settings and page refresh during full screen mode.
  • Fixed BMC SEL Event page, which was causing an error in certain SEL record parsing.
  • Fixed an issue where the Power/Status LED was flashing continuously after the server was rebooted, and the Power/Status LED stayed on after the server was powered off.

Changes in 00.16.09

  • Fixed incorrect temperatures reported for sensors on the NVIDIA Networking ConnectX-6 single-port and dual-port VPI cards.
  • Fixed a bug to ensure that the BMC will boot to the latest version updated on the system.

  • Fixed SEL log not showing the correct BMC or SBIOS version after an update.

  • Added ability to set the BMC to local time instead of default UTC.

  • Added ability to sync local time to NTP servers. (enable NTP time sync).

  • Removed unnecessary SEL log messages pointing to high CPU power consumption.

  • Fixed "/" character not allowed in BMC web UI LDAP Role Group settings.

  • Added authentication capabilities to the RESTful API.

  • Added new capabilities to identify firmware updates in the System Event Log (SEL) via "NVIDIA-firmware" event.

    Adds SEL information for BMC (end), BIOS, CPLD, and PSU.

Changes in 00.14.17

  • Added support for second source SPI ROM.

Changes in 00.14.16

  • Fixed an issue where a cold boot might put the BMC in a non-bootable state.
  • Fixed BMC update failing with "Error flashing Inactive image 2: rc = 0x-9" ,
  • Fixed occasionally neededing to log into the BMC WebUI twice.
  • Fixed the BMC dashboard system event filter not working.
  • Added ability to monitor Mellanox card transceiver temperatures and increase fan speeds.
  • Fixed inability to update the BMC after unexpected interruption.
  • Fixed missing memory, NIC and storage drive information.

Changes in 00.13.16

Changes in 00.13.04

  • Resolved increased fan speed that occurred when optional components are not installed, even when system is idle.

DGX A100 SBIOS Changes

Changes in 1.13

  • Fixed two issues that were causing boot order settings to not be saved to the BMC if applied out-of-band, causing settings to be lost after a subsequent firmware update.
  • Added interactive countdown messages during boot, to display the Setup Prompt Timeout configurable through the Boot > Setup Prompt Timeout configuration menu.
  • Added reporting of AGESA Version in SMBIOS.
  • Updated AGESA to version 1.0.0.D.

Changes in 1.09

  • Fixed an issue where changes in the boot order are not preserved after updating the SBIOS.
  • Fixed inability to enter the SBIOS Admin/User password from the Serial Over LAN (SOL) console.

  • Fixed PXE boot configuration not persisting; helpful for multiple DGX A100 nodes.

  • Added Memory correctable ECC Error leaky bucket; prevents unnecessary replacement of working system DIMMs.

  • Fixed SBIOS Setup > Main page showing incorrect Admin/User Access level.

Changes in 0.34

  • Removed warning message that occurred when the system contained DIMMs from different vendors.

Changes in 0.33

  • Fixed mishandling of correctable PCIe errors.

Changes in 0.30

  • Added support for HTTP boot.
  • Updated DSP/USP preset values to address PCIe advanced error reporting (AER) issues.
  • Changed the following default settings.
    • Determinism Control > [Manual]
    • Determinism Slider > [Power]
    • cTDP Control > [Manual]
    • cTDP > [240]
    • Package Power Limit Control > [Manual]
    • Package Power Limit > [240]
    • DF Cstates > [Disabled]

DGX A100 U.2 NVMe Changes

Changes in EPK9CB5Q

  • Fixed drive going into read-only mode if there is sudden power cycle while performig live firmware update.

  • Improved write performance while performing drive wear-leveling; shortens wear-leveling process time.

  • Fixed drive going into failed mode when a high number of uncorrectable ECC errors occured.

DGX A100 Broadcom 88096 PCIe Switchboard Changes

Changes in 0.2.0

  • Fixed the incorrect setting of the switch's Upstream Port Number as Port 0.

Changes in 1.8

  • Implemented tuning to address PCIe advanced error reporting (AER) issues.

Changes in 1.3

  • Disabled hot-plug and hot-plug surprise capability.

DGX A100 Broadcom 880xx Retimer Changes

Release notes for the DGX A100 Broadcom 88080 and 88064 retimers.

Changes in 1.2f

  • Fixed an issue that caused NVQual to hang while loading the MODS driver.

Changes in 0.F.0

  • Improved error handling of downstream switches.

    This change modifies the PCIe topology and mapping. Refer to the DGX A100 User Guide for PCIe mapping details.

Changes in 0.13.0

  • Fixed DPC Notification behavior for Firmware First Platform.

A100 VBIOS Changes

Changes in 92.00.45.00.03/05

  • Added security protection to the I2C interface.

Changes in 92.00.36.00.04

  • Fixed an issue allocating the BAR1 size across resets.
  • Fixed MIG capability not being reported correctly if the driver is not loaded; for example, if accessed out-of-band.

Changes in 92.00.19.00.10

  • Expanded support for potential alternate HBM sources.

Changes in 92.00.19.00.01

  • Fixed Xid 64 (Row Remapper Error)

DGX A100 BMC CEC Changes

Changes in 3.28

  • Fixed the update progress output reporting "Update_timeout" for the motherboard CEC (MB_CEC) when using the .run file without Docker installed.
  • Fixed the the user's configuration getting lost if the BMC updated failed.

DGX A100 BMC CEC SPI Changes

Changes in 01.05.12

  • Added LDAPS (secure LDAP) support.
  • Resolved network connection getting lost when connected to virtual media.
  • Resolved an issue where occasionally the BMC UI would stop responding.

Changes in 01.05.10

  • Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.
  • Fixed BMC configuration settings not getting applied to both primary and secondary images.
  • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.
  • Fixed issue recovering corrupted firmware on Delta PSU.
  • Fixed BMC web UI reporting BIOS information incorrectly.
  • Fixed BMC Web UI reporting backup BMC version incorrectly.
  • Fixed cryptic BMC entries.
  • Added BMC capture logs from CPLD/FPGA during power on.
  • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).
  • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.
  • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.
  • Fixed how the BMC responds when it cannot read a temperature sensor.
  • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.
  • Fixed the BMC to provide more meaningful and useful SEL logs.
  • Fixed the GPU sensor name on baseboard 2 to match the service label.
  • Changed the naming of U.2 SSDs from "NVME" to "U.2".-
  • Resolved BMC SNMP community string limitations.

Changes in 01.04.03

  • Fixed BMC Update Timeout issue.
  • Fixed BMC configuration backup/restore function not working properly.
  • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.
  • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.
  • Fixed system fans running at 80% after hot-plugging an NVMe drive.
  • Fixed system shutting down after hot-unplugging one of the fans.
  • Fixed system unable to boot after updating BMC image while one BMC module is removed.
  • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.
  • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.
  • Fixed missing DIMM information in the BMC dashboard.
  • Fixed blinking amber-colored power LED.
  • Fixed BMC update freeze while updating using Yafuflash.
  • Fixed issues responding to 3.3V/5V/12V sensors.
  • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports 'device disabled'.
  • The BMC now saves CPU MCA registers wihen it detects a fatal MCA error.

Changes in 01.00.01

  • Fixed BMC update via dashboard erroneously perserving the configuration.
  • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.
  • Added dual FPGA image container update support.
  • Added PSU firmware container update support.
  • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.
  • Added support for FPGA update of Image #1 to the BMC dashboard.
  • Added VLAN support to the BMC dashboard.

DGX A100 BMC Changes

Changes in 01.05.12

  • Added LDAPS (secure LDAP) support.
  • Resolved network connection getting lost when connected to virtual media.
  • Resolved an issue where occasionally the BMC UI would stop responding.

Changes in 01.05.10

  • Fixed an issue with BMC 01.05.07 that potentially affected SBIOS stability.
  • Fixed BMC configuration settings not getting applied to both primary and secondary images.
  • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.
  • Fixed issue recovering corrupted firmware on Delta PSU.
  • Fixed BMC web UI reporting BIOS information incorrectly.
  • Fixed BMC Web UI reporting backup BMC version incorrectly.
  • Fixed cryptic BMC entries.
  • Added BMC capture logs from CPLD/FPGA during power on.
  • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).
  • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.
  • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.
  • Fixed how the BMC responds when it cannot read a temperature sensor.
  • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.
  • Fixed the BMC to provide more meaningful and useful SEL logs.
  • Fixed the GPU sensor name on baseboard 2 to match the service label.
  • Changed the naming of U.2 SSDs from "NVME" to "U.2".-
  • Resolved BMC SNMP community string limitations.

Changes in 01.04.03

  • Fixed BMC Update Timeout issue.
  • Fixed BMC configuration backup/restore function not working properly.
  • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.
  • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.
  • Fixed system fans running at 80% after hot-plugging an NVMe drive.
  • Fixed system shutting down after hot-unplugging one of the fans.
  • Fixed system unable to boot after updating BMC image while one BMC module is removed.
  • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.
  • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.
  • Fixed missing DIMM information in the BMC dashboard.
  • Fixed blinking amber-colored power LED.
  • Fixed BMC update freeze while updating using Yafuflash.
  • Fixed issues responding to 3.3V/5V/12V sensors.
  • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports 'device disabled'.
  • The BMC now saves CPU MCA registers wihen it detects a fatal MCA error.

Changes in 01.00.01

  • Fixed BMC update via dashboard erroneously perserving the configuration.
  • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.
  • Added dual FPGA image container update support.
  • Added PSU firmware container update support.
  • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.
  • Added support for FPGA update of Image #1 to the BMC dashboard.
  • Added VLAN support to the BMC dashboard.

DGX A100 FPGA Release Notes

Features

  • Changes in 01.05.07
    • Fixed BMC configuration settings not getting applied to both primary and secondary images.
    • Fixed corrupted primary BMC failing to recover when primary and secondary images are different versions.
    • Fixed issue recovering corrupted firmware on Delta PSU.
    • Fixed BMC web UI reporting BIOS information incorrectly.
    • Fixed BMC Web UI reporting backup BMC version incorrectly.
    • Fixed cryptic BMC entries.
    • Added BMC capture logs from CPLD/FPGA during power on.
    • Added IPMI OEM command to GET and SET which image the SBIOS is pointing to (Change the PIN).
    • Fixed MaxP/MaxQ System unable to boot after BMC-initiated shutdown with four or more PSU failures.
    • Fixed SEL logs to indicate that a bad fan (or fan speed of zero) may have caused the system to shut down due to GPU overtemp.
    • Fixed how the BMC responds when it cannot read a temperature sensor.
    • Fixed the IPMI log event decoding through ipmitool to show the same events as the GUI.
    • Fixed the BMC to provide more meaningful and useful SEL logs.
    • Fixed the GPU sensor name on baseboard 2 to match the service label.
    • Changed the naming of U.2 SSDs from "NVME" to "U.2".
    • BMC SNMP Support on DGX-2
  • Changes in 01.04.03
    • Fixed BMC Update Timeout issue.
    • Fixed BMC configuration backup/restore function not working properly.
    • Fixed system not shutting down when all fans in Fan Zone 2 or 3 are not detected.
    • Fixed system fans all running at 80% after hot-unplugging/hot-plugging a PSU.
    • Fixed system fans running at 80% after hot-plugging an NVMe drive.
    • Fixed system shutting down after hot-unplugging one of the fans.
    • Fixed system unable to boot after updating BMC image while one BMC module is removed.
    • Fixed incorrect SEL timestamp after executing ipmi mc reset cold.
    • Fixed missing firmware information in the BMC dashboard. Information is available on the Maintenance->Firmware Information page.
    • Fixed missing DIMM information in the BMC dashboard.
    • Fixed blinking amber-colored power LED.
    • Fixed BMC update freeze while updating using Yafuflash.
    • Fixed issues responding to 3.3V/5V/12V sensors.
    • Fixed incorrect responses to GPU temperature assertion - Fan Zone 1 goes to 80% and DIMM temperature reports 'device disabled'.
    • The BMC now saves CPU MCA registers wihen it detects a fatal MCA error.
  • Changes in 01.00.01
    • Fixed BMC update via dashboard erroneously perserving the configuration.
    • Fixed Network Link Configuration and Network IP Settings pages on the BMC dashboard to reflect changes only when saved.
    • Added dual FPGA image container update support.
    • Added PSU firmware container update support.
    • Enhanced SMBPBI support for GPU sensors, thermal polling and FAN control to avoid anomalous sensor reading for GPU sensors and corresponding thermal actions.
    • Added support for FPGA update of Image #1 to the BMC dashboard.
    • Added VLAN support to the BMC dashboard.

DGX A100 Delta PSU Release Notes

Changes in 1.6/1.6/1.7

  • Fixed 0W reporting issue.