Firmware Changes for NVIDIA DGX B200 Systems#
BMC Changes for DGX B200 Systems#
Changes in 26.03.03#
Fixed occasional IPMI hang with NVSM.
Fixed debug log collection.
Improved BMC performance.
Enhanced SEL logging performance and minimized thermal and fan speed monitoring messages.
Improved BMC stability.
Changes in 25.12.11#
Updated OpenSSH to version openssh-10.0p1-2.
Incorporated security fixes in bootloader to address squashfs vulnerabilities.
Resolved an issue where fans were running at maximum speed.
Resolved an issue where differing time zone settings between the BMC and HMC caused continuous communication recovery attempts.
Enhanced BMC stability.
Changes in 25.06.27#
Resolved the issue of high fan speed in zone 1 during system idle.
Fixed an issue that intermittently fetched an invalid HMC configuration path.
Resolved intermittent BMC Redfish unresponsiveness.
Changes in 25.02.12#
The initial BMC firmware version.
SBIOS Changes for DGX B200 Systems#
Changes in 1.6.7#
Fixed an issue where a boot order setting via Redfish API did not take effect.
Changes in 1.6.6#
The initial SBIOS firmware version.
GPU Tray Firmware Fixes#
Issues Fixed in 1.4.30#
NVLink task scheduling behavior causing GPU driver hangs.
After extended uptime (approximately 60+ days), systems can exhibit NVLink task scheduling behavior that causes GPU driver hangs and, when using R580 driver or later, XID 150 errors in the kernel log. The root issue is a bug in the vBIOS firmware in the GPU’s NVLink management microcode, where a counter monitors total execution time of NVLink message processing tasks.
Having reproduced the issue and clearly identified the faulty code, we fixed the coding error and confirmed that this issue no longer occurs.
Issues Fixed in 1.4.00#
GPU BER counters not updating after PTER-based error injection.
Starting with VBIOS 97.00.93.00.00, the GPU port counter access transitioned from
PCI_CRtoNETIR, which breaks NVRASTool validation during BER error injection using PTER. This is due to NETIR-based counter access intermittently mapping to incorrect GPU instances across reboots, leading to validation failures.Prior to this VBIOS change, all iterations of
NVSWITCH_NVKLINK_UNCORRECTABLE_EventandNVSWITCH_NVLINK_CORRECTABLE_Errorconsistntly passed using PCI_CR-based validation.Workaround:
Reboot the system when port mapping mismatches are observed.
This issue has been fixed.
GPU temperature and Tlimit reported nan.
GPU telemetry may remain unavailable indefinitely if PERST is asserted again shortly after reboot.
This issue is fixed. The HMC has been updated to increase telemetry retry handling to prevent the controller from giving up in this scenario. With this change, telemetry recovers automatically once the GPU becomes responsive without user action required.
NCCL performance is slow.
NCCL ALL-REDUCE performance may be slower than expected due to incorrect default NVSWITCH power profile.
This issue is fixed in NVSwitch version 35.2014.4770 by using the correct default profile in B200.
Issues Fixed in 1.3.10#
In rare cases, the system may encounter an XID 149 error.
To migrate this issue, we have changed the default configuration of NVLink low power state (L1) to disable. In inference workloads, the performance workload is negligible. In training workloads, a small impact may be observed.
The issue is fixed in NVSwitch version 35.2015.4718.
Issues Fixed in 1.3.00#
All NVLinks become inactive during 1N inference workload.
When a burst of Team Multicast Setup requests is sent by an app, intermittent failures may be reported while allocating the Team Multicast groups and cause the app to hang. The failure occurs due to buffer corruption and credit exhaustion issues when a burst of setup requests is processed by the SW/FW pipeline.
RM Initialization failures following GPU reset.
GPU ECC Unlock NSM RAW Commands may return unexpected completion codes.
GPU ECC error injections that enable error injection by sending NSM Type 5 Commands may receive an unexpected non-zero Completion code as the response.
Real thermal faults may not generate event logs, leading to undetected thermal events that could seriously impact system operation.
GPUs missing from nvidia-smi console
This release includes a fix for a corner case in handling reset when the driver is loaded. This issue manifests as the GPU missing from nvidia-smi console across VM reboots.
Long delay in getting the VM’s IP after rebooting the VM
Recent kernel changes in 5.14.0-570.26.1.el9_6 modified the memory mapping process for passthrough devices. This can result in long delays in getting the VM’s IP after rebooting the VM.
Thermal testing on B200 SKU 220 systems
We have observed that NVQual Test 1 (Thermal Qualification) may fail to run in some instances on B200 SKU 220 systems, resulting in an “NVRM generic error” (Error 000000000074) during GPU initialization. This issue affects the thermal validation process and may prevent completion of thermal diagnostics.
Part number missing on
/redfish/v1/Chassis/HGX_GPU_SXM_XURIsA recent change in the HMC’s behavior causes it to report SXM PN instead of the previously reported GPU Board assembly SN over
/redfish/v1/Chassis/HGX_GPU_SXM_XURIs. The SXM PN field is not populated in the infoROM on existing platforms, such as B200, B300, GB200, and GB300, causing the Part Number to be missing.
Issues Fixed in 1.2.10#
After installing firmware 1.2 on B200 hosts, nvidia_gpu_tools.py crashes due to a race condition in FSP firmware.
FSP firmware has a race condition in v1.2.00 between tasks servicing out-of-band SPDM from the HMC and a staging buffer used in the boot process. This results in the staging buffer contents failing authentication and thus boot. This could also manifest as InfoROM corruption.
Issues Fixed in 1.1.00#
PCI Express Capabilities exposure and its impact on the NVIDIA driver behavior.
Ideally, the OS should expose all config space registers, including ones from the extended config space. When the OS has correct reasons to hide certain capabilities, here is a list of registers that the NVIDIA driver expects to be exposed by the OS for a boot up better than when the capabilities are hidden by the OS.
This issue has been fixed, and the RM driver will not log error-level prints for non-fatal reads to configure space failing.
The Redfish URI takes longer to respond (10-12 times longer than average) when a firmware update is triggered.
During firmware updates, users might experience a substantial increase in URI (Uniform Resource Identifier) response times. The maximum observed latency can spike up to 15 seconds, which affects various URIs.
HGX_Chassis_0_HSC_[0-9]_Power_0 ReadingTime property is invalid because the FPGA timestamp cannot be converted to a Redfish timestamp.
The difference between the value of the ReadingTime property of an HSC power sensor for two consecutive Redfish requests is not the same as the time delay between these Redfish requests.
Here is a list of the affected URIs and properties:
/redfish/v1/Chassis/HGX_Chassis_0/Sensors/HGX_Chassis_0_HSC_[0-9]_Power_0 ReadingTime
The Redfish schema for histograms has been restructured.
In the HMC for 0.9 release, the following OEM Redfish URIs were introduced for Power Histogram data:
/redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>/redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>/Buckets/redfish/v1/Fabrics/HGX_NVLinkFabric_0/Switches/NVSwitch_[0-1]/Oem/Nvidia/Histograms/<str>/Buckets/<id>
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONis reporting the value in Joule instead of mJA discrepancy has been identified in the energy measurement reporting for GPU sensors in the Redfish API, and this issue affects the
/redfish/v1/Chassis/HGX_GPU_SXM_{InstanceId}/Sensors/HGX_GPU_SXM_{InstanceId}_Energy_0sensor endpoint.The FDR Service has high CPU usage at boot up for about three minutes.
There is a high CPU consumption for the FDR service for about three minutes after the HMC is ready. These high consumption results can impact the Redfish APIs for a maximum of five seconds within the first three minutes after the HMC is ready.
BMC internal resource events show up under the System Events URI.
The BMC health monitor application pushes resource consumption events when consumption exceeds the predefined limits. These events are not a sign of any faults or errors in the system and are treated as information messages.
The HMC is not showing a correct Power Reading for GPU DRAM/HBM and is using GPU instead of HBM Power.
The HBM power and energy readings as reported on the
/redfish/v1/Systems/HGX_Baseboard_0/Memory/GPU_SXM_X_DRAM_0/EnvironmentMetricsRedfish resource is incorrect.[openBMC] [NVBMC-Security] - U-Boot Multiple Vulnerabilities - March 2025.
NVIDIA became aware of two new vulnerabilities presented in u-boot code used within the NVIDIA openBMC firmware for Data Center products (CVE-2024-57258 and CVE-2024-57256). These vulnerabilities can be exploited if a privileged user gains access to the ext4 or SquashFS file systems. NVIDIA has provided a patched firmware version to handle these vulnerabilities.
Visit the NVIDIA Product Security page to learn more about the vulnerability management process followed by the NVIDIA Product Security Incident Response Team (PSIRT).