Is this page helpful?

Firmware Changes for NVIDIA DGX B300 Systems#

BMC Changes for DGX B300 Systems#

Changes in 01.01.44#

Enhanced handling of sudden temperature spikes to reduce sudden shutdowns.
Filtered corrupt temperature readings outside valid silicon boundaries.
Fixed issue that might result in system not powering on after shutdown.
Improved thresholding to reduce (will not eliminate) duplicate event generation.

Changes in 01.01.35#

Fixed a minor memory leak in the BMC.
Resolved an issue where fan speeds unexpectedly dropped.

Changes in 01.01.25#

Improved handling of system fans.
Improved telemetry stability and responsiveness.
Fixed power button responsiveness.
Fixed drives presence detection.
Improved sensor accuracy issues.
Resolved unexpected system shutdowns caused by incorrect CPU temperature threshold settings.
Improvements to overall power handling.
Improved overall ConnectX firmware reliability.
SEL log improvements.
Fixed debug log collection.

Changes in 01.00.03#

Implemented improvements to ensure services restart reliably.

Changes in 01.00.00#

The initial BMC firmware version.

SBIOS Changes for DGX B300 Systems#

Changes in 01.00.08#

Incorporated security updates.

Changes in 01.00.06#

Resolved an issue where BIOS boot order changes made via Redfish API were not reflected after reboot.

Changes in 01.00.05#

The initial SBIOS firmware version.

GPU Tray Firmware Fixes#

Issues Fixed in 1.5.00#

Certain URIs are not present after AC or DC cycling.

During AC or DC cycling, certain URIs are not present when queried. For example, FirmwareInventory does not show for all devices.
v1.4.30 Vulnerabilities

The following vulnerabilities, which were identified in the v1.4.30 release, are addressed by this release:
- CVE-2019-11690
- CVE-2021-36647
- CVE-2021-45451
I2C Rising Time Failure from CX8 to OSFP Transceiver.

In certain configurations, marginal signal timing on the internal control bus can cause intermittent communication failures between the network controller and OSFP optical modules.
CX8 SPDM measurement is not working for CoRIM test.

Index 51 is not included in the overall SPDM measurements report. However, it can be requested with an additional SPDM measurement request by index (51).

Issues Fixed in 1.4.30#

NVLink task scheduling behavior causing GPU driver hangs.

After extended uptime (approximately 60+ days), systems can exhibit NVLink task scheduling behavior that causes GPU driver hangs and, when using R580 driver or later, XID 150 errors in the kernel log. The root issue is a bug in the vBIOS firmware in the GPU’s NVLink management microcode, where a counter monitors total execution time of NVLink message processing tasks.

Upon identifying and correcting the defect, the resolution was verified.

Issues Fixed in 1.4.00#

Critical message when recovery update times out

If all 8 GPUs are in recovery mode, the total recovery time may exceed 20 minutes, causing the task status to appear as “Critical.” This can be safely ignored — instead, referring to the Redfish task output to check the recovery status of each individual GPU.
WriteProtected missing from /redfish/v1/UpdateService/FirmwareInventory/<device>

This issue can happen while the firmware update is in progress. As a result, some of the fields may be missing from the response.
Multiple CX-8 failed to initialize after warm reboot

An abrupt warm reset of the system may result in intermittent issues in which some CX8 don’t enable the PCIe receiver to allow the retimer to detect the endpoint receiver. This results in the system booting without initializing the CX8.
I2C rising time failure from CX8 to OSFP transceiver

In certain configurations, marginal signal timing on the internal control bus can cause intermittent communication failures between the network controller and OSFP optical modules.
Downstream device detection failure

In certain cases, downstream devices are not detected when the CX8 PCI switch is enabled. Users may also see the GPU drop from bus and GPU sensors reading and property is nan. This occurs when the device fails to receive the PERST# assertion required for initialization.

NCCL ALL-REDUCE performance may be lower than expected due to incorrect default NVSWITCH power profile. Refer to the following recovery steps.

This issue only occurs when users have manually modified PCI configurations using mlxconfig. For example, mlxconfig settings restrict the speed, width, or power states of specific PCI buses:
- BUS00_RESTRICT_WIDTH (used for bus00, bus10, bus11, bus12, etc.)
- PCI_BUS00_RESTRICT_ASPM
- PCI_BUS0_RESTRICT_SPEED
Commands such as these can cause this issue. If PCI settings haven’t been changed through mlxconfig, no action is required.

Recovery Steps for Legacy Firmware

If you cannot update the firmware immediately, you can restore detection using one of the following options:
- Option 1: Reset configuration
  
  Run the following command to reset the device configuration:
```
mlxconfig -d <device> -y reset
```
- Option 2: Manual PERST configuration (B300)
  
  Manually set the PERST parameters using these commands:
```
mlxconfig -d <device> set PCI_BUS10_CONTROL_EN=1
mlxconfig -d <device> set PCI_BUS10_PERST_SOURCE=2
mlxconfig -d <device> set PCI_BUS10_PERST_GPIO=8
```
NCCL performance is slow.

NCCL ALL-REDUCE performance may be lower than expected due to incorrect default NVSWITCH power profile.

This issue is fixed in NVSwitch version 35_2014_4770 by using the correct default profile in B200.

Issues Fixed in 1.0.10#

In rare cases, the system might encounter an XID error.

To mitigate this issue, we have changed the default configuration of NVLink low power state (L1) to disable. For inference workloads, the performance impact is negligible. For training workloads, a small impact may be observed.

This issue is fixed in NVSwitch version 35_2015_4718.