Known Issues#
Functional Issues#
You cannot update firmware of the individual components of the DGX B200 GPU tray. For example, you can not individually update the firmware for the GPU only. You must update the firmware by flashing the entire DGX B200 GPU tray.
Firmware download is not automatic. You must download the firmware manually from the NVIDIA Enterprise Support Portal.
BMC Reports 1848 RPM for Removed Fan Modules#
Issue#
When a fan module is physically removed from the chassis, the BMC reports a constant value of 1848 RPM instead of indicating the fan is missing or returning 0 RPM. For example,
$ ipmitool sdr elist | grep -I fan
...
SPD_FAN_4_3_R AAh | lcr | 29.10 | 1848 RPM
SPD_FAN_4_3_F ABh | lcr | 29.10 | 1848 RPM
...
Root Cause
When no fan is connected, the BMC’s sensor reads a specific voltage state from the empty fan slot. The firmware translates this into 1848 RPM rather than interpreting it as a “missing” state.
Verification
To distinguish this artifact from a genuine fan failure:
Check sensor list:
ipmitool sdr elist | grep -I fan
If a fan reports 1848 RPM, the fan module is likely physically missing or unseated.
Verify by physical inspection of the chassis.
Workaround#
This is a cosmetic reporting issue only and does not affect system functionality. Verify fan presence through physical inspection when 1848 RPM is reported.
VBIOS Incompatibility Issue#
Issue#
Updating directly from a version earlier than 97.00.5E.00.XX to a version later than 97.00.7C.00.XX might fail. When using the Redfish method, you might see an error similar to the following:
{
"@odata.type": "#MessageRegistry.v1_4_1.MessageRegistry",
"Message": "Verification of image '97.00.7C.00.05' at HGX_FW_GPU_SXM_4' failed.
"MessageArgs": [
"97.00.7C.00.05",
"HGX_FW_GPU_SXM_4"
],
"MessageID": "Update.1.0.VerificationFailed",
"Resolution": "None.".
"Severity": "Critical"
}
Explanation#
VBIOS firmware data structures in versions earlier than 97.00.5E.00.XX and versions later than 97.00.7C.00.XX are incompatible.
Workaround#
If your current VBIOS version is:
97.00.5E.00.XX or later:
Update to the latest VBIOS version directly.
Earlier than 97.00.5E.00.XX:
Follow these steps:
Update to a version between 97.00.5E.00.XX and 97.00.7C.00.XX.
Then, update to the latest version.
Misleading Messages During Firmware Update#
Issue#
During the process of the ConnectX-7 firmware update, upon completion of applying the update,
a reboot is required as suggested by these messages: To load new FW, run mlxfwreset or reboot machine.
and Please reboot machine to load new configurations. However, rebooting the system does not load
the firmware update or new configurations properly for the ConnectX-7 firmware versions 28.36.1010 and later.
Workaround#
For the firmware update and new configurations to load successfully, perform an AC power cycle on the system instead of rebooting.
Firmware Inventory Can Be Invalid During Boot#
Issue#
In rare instances, polling the firmware inventory endpoint of the BMC Redfish API can report an inaccurate firmware versions for the HGX_0 component.
Workaround#
Query the firmware inventory after the system completes the boot sequence to retrieve the current firmware inventory.
BMC Slow Startup After AC Power Cycle#
Issue#
After an AC power cycle, the BMC can require approximately 10 minutes before it is available for communication. The BMC is typically available within three minutes.
Workaround#
No workaround is available.
Temperature Sensors Can Report No Reading#
Issue#
The following sensors can report No Reading rather than a temperature value:
TEMP_PSU4
TEMP_PSU5
PWR_PSU5
SPD_FAN_PSU5_R
SPD_FAN_PSU5_R
STATUS_PSU0
STATUS_PSU1
STATUS_PSU2
STATUS_PSU3
STATUS_PSU4
STATUS_PSU5
STATUS_HMC
TEMP_PCIE_SW_1
TEMP_Cedar_OSFP0
TEMP_Cedar_OSFP1
TEMP_Cedar_OSFP2
TEMP_Cedar_OSFP3
TEMP_PCIE_CX7_1
TEMP_PCIE_CX7_2
TEMP_CX7_QSFP0
TEMP_CX7_QSFP1
TEMP_CX7_QSFP2
TEMP_CX7_QSFP3
TEMP_Intel_NIC
TEMP_NIC_QSFP0
TEMP_NIC_QSFP1
Workaround#
Polling the sensors again can resolve the issue.