Firmware Changes for NVIDIA DGX B200 Systems#

BMC Changes for DGX B200 Systems#

Changes in 25.12.11#

  • Updated OpenSSH to version openssh-10.0p1-2.

  • Incorporated security fixes in bootloader to address squashfs vulnerabilities.

  • Resolved an issue where fans were running at maximum speed.

  • Resolved an issue where differing time zone settings between the BMC and HMC caused continuous communication recovery attempts.

  • Enhanced BMC stability.

Changes in 25.06.27#

  • Resolved the issue of high fan speed in zone 1 during system idle.

  • Fixed an issue that intermittently fetched an invalid HMC configuration path.

  • Resolved intermittent BMC Redfish unresponsiveness.

Changes in 25.02.12#

  • The initial BMC firmware version.

SBIOS Changes for DGX B200 Systems#

Changes in 1.6.7#

  • Fixed an issue where a boot order setting via Redfish API did not take effect.

Changes in 1.6.6#

  • The initial SBIOS firmware version.

GPU Tray Firmware Fixes#

Issues Fixed in 1.3.10#

  • In rare cases, the system may encounter an XID 149 error.

    To migrate this issue, we have changed the default configuration of NVLink low power state (L1) to disable. In inference workloads, the performance workload is negligible. In training workloads, a small impact may be observed.

    The issue is fixed in NVSwitch version 35.2015.4718.

Issues Fixed in 1.3.00#

  • All NVLinks become inactive during 1N inference workload.

    When a burst of Team Multicast Setup requests is sent by an app, intermittent failures may be reported while allocating the Team Multicast groups and cause the app to hang. The failure occurs due to buffer corruption and credit exhaustion issues when a burst of setup requests is processed by the SW/FW pipeline.

  • RM Initialization failures following GPU reset.

  • GPU ECC Unlock NSM RAW Commands may return unexpected completion codes.

    GPU ECC error injections that enable error injection by sending NSM Type 5 Commands may receive an unexpected non-zero Completion code as the response.

  • Real thermal faults may not generate event logs, leading to undetected thermal events that could seriously impact system operation.

  • GPUs missing from nvidia-smi console

    This release includes a fix for a corner case in handling reset when the driver is loaded. This issue manifests as the GPU missing from nvidia-smi console across VM reboots.

  • Long delay in getting the VM’s IP after rebooting the VM

    Recent kernel changes in 5.14.0-570.26.1.el9_6 modified the memory mapping process for passthrough devices. This can result in long delays in getting the VM’s IP after rebooting the VM.

  • Thermal testing on B200 SKU 220 systems

    We have observed that NVQual Test 1 (Thermal Qualification) may fail to run in some instances on B200 SKU 220 systems, resulting in an “NVRM generic error” (Error 000000000074) during GPU initialization. This issue affects the thermal validation process and may prevent completion of thermal diagnostics.

  • Part number missing on /redfish/v1/Chassis/HGX_GPU_SXM_X URIs

    A recent change in the HMC’s behavior causes it to report SXM PN instead of the previously reported GPU Board assembly SN over /redfish/v1/Chassis/HGX_GPU_SXM_X URIs. The SXM PN field is not populated in the infoROM on existing platforms, such as B200, B300, GB200, and GB300, causing the Part Number to be missing.

Issues Fixed in 1.2.10#

  • After installing firmware 1.2 on B200 hosts, nvidia_gpu_tools.py crashes due to a race condition in FSP firmware.

    FSP firmware has a race condition in v1.2.00 between tasks servicing out-of-band SPDM from the HMC and a staging buffer used in the boot process. This results in the staging buffer contents failing authentication and thus boot. This could also manifest as InfoROM corruption.

Issues Fixed in 1.1.00#

  • PCI Express Capabilities exposure and its impact on the NVIDIA driver behavior.

    Ideally, the OS should expose all config space registers, including ones from the extended config space. When the OS has correct reasons to hide certain capabilities, here is a list of registers that the NVIDIA driver expects to be exposed by the OS for a boot up better than when the capabilities are hidden by the OS.

    This issue has been fixed, and the RM driver will not log error-level prints for non-fatal reads to configure space failing.

  • The Redfish URI takes longer to respond (10-12 times longer than average) when a firmware update is triggered.

    During firmware updates, users might experience a substantial increase in URI (Uniform Resource Identifier) response times. The maximum observed latency can spike up to 15 seconds, which affects various URIs.

  • HGX_Chassis_0_HSC_[0-9]_Power_0 ReadingTime property is invalid because the FPGA timestamp cannot be converted to a Redfish timestamp.

    The difference between the value of the ReadingTime property of an HSC power sensor for two consecutive Redfish requests is not the same as the time delay between these Redfish requests.

    Here is a list of the affected URIs and properties:

    /redfish/v1/Chassis/HGX_Chassis_0/Sensors/HGX_Chassis_0_HSC_[0-9]_Power_0 ReadingTime
    
  • The Redfish schema for histograms has been restructured.

    In the HMC for 0.9 release, the following OEM Redfish URIs were introduced for Power Histogram data:

    • /redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>

    • /redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>/Buckets

    • /redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>/Buckets/<id>

  • DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION is reporting the value in Joule instead of mJ

    A discrepancy has been identified in the energy measurement reporting for GPU sensors in the Redfish API, and this issue affects the /redfish/v1/Chassis/HGX_GPU_SXM_{InstanceId}/Sensors/HGX_GPU_SXM_{InstanceId}_Energy_0 sensor endpoint.

  • The FDR Service has high CPU usage at boot up for about three minutes.

    There is a high CPU consumption for the FDR service for about three minutes after the HMC is ready. These high consumption results can impact the Redfish APIs for a maximum of five seconds within the first three minutes after the HMC is ready.

  • BMC internal resource events show up under the System Events URI.

    The BMC health monitor application pushes resource consumption events when consumption exceeds the predefined limits. These events are not a sign of any faults or errors in the system and are treated as information messages.

  • The HMC is not showing a correct Power Reading for GPU DRAM/HBM and is using GPU instead of HBM Power.

    The HBM power and energy readings as reported on the /redfish/v1/Systems/HGX_Baseboard_0/Memory/GPU_SXM_X_DRAM_0/EnvironmentMetrics Redfish resource is incorrect.

  • [openBMC] [NVBMC-Security] - U-Boot Multiple Vulnerabilities - March 2025.

    NVIDIA became aware of two new vulnerabilities presented in u-boot code used within the NVIDIA openBMC firmware for Data Center products (CVE-2024-57258 and CVE-2024-57256). These vulnerabilities can be exploited if a privileged user gains access to the ext4 or SquashFS file systems. NVIDIA has provided a patched firmware version to handle these vulnerabilities.

    Visit the NVIDIA Product Security page to learn more about the vulnerability management process followed by the NVIDIA Product Security Incident Response Team (PSIRT).

nvfwupd Command Changes#

Changes in 2.0.9#

  • An activation command (RF_PWR_STATUS) has been introduced to provide reliable insight into system power, making firmware updates, maintenance, and monitoring much safer and more efficient.

  • Introduced flint support for updating firmware directly on the host system.

  • A delay mechanism (Update_Delay feature) has been introduced to avoid memory overload and keep simultaneous BMC/HMC updates running smoothly.

  • Support added for OEM parameter support for firmware updates using the --oem_parameters option.

  • Tarfile support for PowerShelf update packages has been restored.

Changes in 2.0.8#

  • Extended show_version to support parallel queries across multiple systems and provide multi-package output.

  • Added Ctrl-C handler for clean termination of multithreaded operations.

  • Deprecated support for tarfile update packages.

  • Upgraded the Python runtime from 3.7 to 3.12.

Changes in 2.0.5#

  • Added support for parallel firmware updates through the YAML configuration file.

  • Added the --json option to the update_fw, show_update_progress, and force_update commands.

  • Added IPv6 support.

  • Deprecated the targets sub-option for multi-target input. Use config.yaml input instead.