Firmware Changes for NVIDIA DGX B300 Systems#

BMC Changes for DGX B300 Systems#

Changes in 01.01.25#

  • Improved handling of system fans.

  • Improved telemetry stability and responsiveness.

  • Fixed power button responsiveness.

  • Fixed drives presence detection.

  • Improved sensor accuracy issues.

  • Resolved unexpected system shutdowns caused by incorrect CPU temperature threshold settings.

  • Improvements to overall power handling.

  • Improved overall ConnectX firmware reliability.

  • SEL log improvements.

  • Fixed debug log collection.

Changes in 01.00.03#

  • Implemented improvements to ensure services restart reliably.

Changes in 01.00.00#

  • The initial BMC firmware version.

SBIOS Changes for DGX B300 Systems#

Changes in 01.00.06#

  • Resolved an issue where BIOS boot order changes made via Redfish API were not reflected after reboot.

Changes in 01.00.05#

  • The initial SBIOS firmware version.

GPU Tray Firmware Fixes#

Issues Fixed in 1.4.30#

  • NVLink task scheduling behavior causing GPU driver hangs.

    After extended uptime (approximately 60+ days), systems can exhibit NVLink task scheduling behavior that causes GPU driver hangs and, when using R580 driver or later, XID 150 errors in the kernel log. The root issue is a bug in the vBIOS firmware in the GPU’s NVLink management microcode, where a counter monitors total execution time of NVLink message processing tasks.

    Upon identifying and correcting the defect, the resolution was verified.

Issues Fixed in 1.4.00#

  • Critical message when recovery update times out

    If all 8 GPUs are in recovery mode, the total recovery time may exceed 20 minutes, causing the task status to appear as “Critical.” This can be safely ignored — instead, referring to the Redfish task output to check the recovery status of each individual GPU.

  • WriteProtected missing from /redfish/v1/UpdateService/FirmwareInventory/<device>

    This issue can happen while the firmware update is in progress. As a result, some of the fields may be missing from the response.

  • Multiple CX-8 failed to initialize after warm reboot

    An abrupt warm reset of the system may result in intermittent issues in which some CX8 don’t enable the PCIe receiver to allow the retimer to detect the endpoint receiver. This results in the system booting without initializing the CX8.

  • I2C rising time failure from CX8 to OSFP transceiver

    In certain configurations, marginal signal timing on the internal control bus can cause intermittent communication failures between the network controller and OSFP optical modules.

  • Downstream device detection failure

    In certain cases, downstream devices are not detected when the CX8 PCI switch is enabled. Users may also see the GPU drop from bus and GPU sensors reading and property is nan. This occurs when the device fails to receive the PERST# assertion required for initialization.

    NCCL ALL-REDUCE performance may be lower than expected due to incorrect default NVSWITCH power profile. Refer to the following recovery steps.

    This issue only occurs when users have manually modified PCI configurations using mlxconfig. For example, mlxconfig settings restrict the speed, width, or power states of specific PCI buses:

    • BUS00_RESTRICT_WIDTH (used for bus00, bus10, bus11, bus12, etc.)

    • PCI_BUS00_RESTRICT_ASPM

    • PCI_BUS0_RESTRICT_SPEED

    Commands such as these can cause this issue. If PCI settings haven’t been changed through mlxconfig, no action is required.

    Recovery Steps for Legacy Firmware

    If you cannot update the firmware immediately, you can restore detection using one of the following options:

    • Option 1: Reset configuration

      Run the following command to reset the device configuration:

      mlxconfig -d <device> -y reset
      
    • Option 2: Manual PERST configuration (B300)

      Manually set the PERST parameters using these commands:

      mlxconfig -d <device> set PCI_BUS10_CONTROL_EN=1
      mlxconfig -d <device> set PCI_BUS10_PERST_SOURCE=2
      mlxconfig -d <device> set PCI_BUS10_PERST_GPIO=8
      
  • NCCL performance is slow.

    NCCL ALL-REDUCE performance may be lower than expected due to incorrect default NVSWITCH power profile.

    This issue is fixed in NVSwitch version 35_2014_4770 by using the correct default profile in B200.

Issues Fixed in 1.0.10#

  • In rare cases, the system might encounter an XID error.

    To mitigate this issue, we have changed the default configuration of NVLink low power state (L1) to disable. For inference workloads, the performance impact is negligible. For training workloads, a small impact may be observed.

    This issue is fixed in NVSwitch version 35_2015_4718.