Is this page helpful?

Known Issues#

This section provides a list of the known issues.

XID 150/154 immediately after GPU reset (MSE communication initialization).

In very rare conditions, immediately after a GPU reset, XID 150 followed by XID 154 may occur due to a failure to initialize communications with the MSE (Management Services Engine) microcode.
RDMI NCCL failures observed.

In the rare event of a communication loss between the RDM and GFM, a memory leak may occur in NVLSM.
On certain GB200 NVSwitch platforms, the switch reboot history may display multiple unexpected reboots with the reason shown as Platform reset (retrieved using the NVOS command: nv show sys reboot -o json | jq -r .reason). NVIDIA plans to include a fix in a future NVOS release.
NVIDIA compute firmware recovery bundle fails integrity check.

In release 1.3.2, updating GPU firmware using the recovery bundle may fail with the Integrity check failed for FW package message in the Firmware Update Service. The update is rejected before the image is applied, and the GPU remains on its previous firmware version.

Workaround: Use the standard compute-tray firmware bundle for GB200 NVL72 v1.3.2 (non-recovery …_prod-signed.fwpkg) to update or restore GPU firmware. If a recovery bundle is required, contact NVIDIA to obtain a corrected package and retry the update with that bundle.
NETIR dumps missing, libnvidia-ml.so error in NV-Bug-Report.

Workaround: Use the following steps as a temporary workaround. The symbolic link will be restored in a future release.
1. Use one of the following commands to locate the libnvidia-ml.so.1 file:
```
find /usr -name "libnvidia-ml.so.1"
locate libnvidia-ml.so.1
```
2. Use one of the following commands to create a symbolic link named libnvidia-ml.so pointing to the located file in one of the standard library installation paths:
```
ln -sf <path-to-libnvidia-ml.so.1> /usr/lib64/libnvidia-ml.so
ln -sf <path-to-libnvidia-ml.so.1> /usr/lib/libnvidia-ml.so
```
2+ domain NCCL All-to-All testing may result in system or deadlock hangs with no error codes.

Workaround: There is currently no workaround, and this issue will be fixed in a future release. Future versions of NCCL include atomic locks to code path to prevent the race condition.
A GPU that is part of a multicast group cannot be removed from the partition; multicast can be reused prematurely.

In GB200/GB300, if a GPU is participating in an active Multicast Team, it cannot be removed from a partition.

Workaround:
- Reset or power cycle all partition GPUs — On startup, the probe request will release any previously allocated multicast groups. Once the probe request completes, the GPU can then be removed from the partition.
- Delete and recreate the partition — Delete the current partition and create a new one that excludes the GPUs intended for removal. NVIDIA recommends using a different partition ID for the new partition.
GPU reset operations may fail with kernel Xid errors (for example, GSP Timeout) and require a rack-level power cycle to recover if a subset of NVSwitch trays are rebooted, either unexpectedly or through an orderly reboot, while an NVLink Sharp workload is actively running.

Workaround: If the GPU encounters this issue, a system power cycle is required to restore functionality. Either an AC or DC power cycle will bring the GPU back online.
Occasionally, after a GPU reset, the Fabric status in nvidia-smi/NVML may report Insufficient Resources, even though all GPU NVLinks are active.

Workaround: Reset the affected GPUs again. If no GPU-resident services (such as nvidia-persistenced) are running, allow a sufficient amount of time (~10 seconds) between the GPU reset and checking the Fabric status.
NVLink NVLS traffic fails with Xid 145 after partial switch tray reboot.

Workaround: If only a subset of switch trays are rebooted, perform an additional NMX-C restart after all switch trays are back online. Or, avoid rebooting a subset of switches and instead, reboot/power cycle all switch trays.
Micron NVMe drives may become unavailable when connected behind the Mellanox CX8 due to a PCIe Max Payload Issue. This issue does not impact GB200 with CX7, but it does impact GB200 with CX8.

Workaround: Set the PCIe Max Payload setting to 128bytes in the SBIOS setup. A firmware patch in the SBIOS is also available but currently not included in the SBIOS in the 1.3.0 bundle.
Type 3 Command 0x4A GPM Query per-instance GPM metrics: Querying some metrics will get 0x7e ERR_BUSY.

Workaround: There is currently no workaround, and this issue will be fixed in a future firmware release.
Type 3 Command 0x02 Read Thermal Parameter: unknown Target GPU TLIMIT temperature.

Workaround: There is currently no workaround. The target T.Limit threshold does not apply to this product, and the value can be safely ignored.
Type 3 Command 0x05 Clear Max Observed Power: Got ERR_INVALID_DATA when sensor ID=255.

Workaround: Specify the sensors to be cleared one at time instead of clearing them all at once using the aggregate sensor ID (255).
The Grace RW SPI read functionality needs to be improved to meet the RMA criteria.

Based on internal testing, the read full SPI time to read is one hour.

Workaround: There is currently no workaround, and this issue will be fixed in a future release.
The NVSwitch EROT recovery is not working.

Workaround: Customers can still recover a failed ERoT and can use the typical normal firmware update mechanisms to update ERoTs.
BMC PCIe link reset causes SBIOS exception.

Workaround: Customers should avoid ungraceful PCIe link resets to the BMC while systems are operational. Partners that do not route PCIe to the BMC are not impacted by this issue.
coRIM should reference the value of FWID tcg-dice-TcbInfo in AliasKeyCert.

The FWID[0] field of DiceTcbInfo in the GPU iRoT DICE certificate contains the FSP FMC measurement. This value is written to an RTS hardware register (MSR 2), with hardware recording additional state information to compute the value to be stored in the register. The final register value is reported in SPDM measurement block Index 4.

The following Python code shows how to generate the measurement block 4 value in the GPU CoMID based from the FWID value contained in the DICE certificate.
Firmware Update task completed 100% but says critical.

A full log file causes the task status update to fail.

Workaround

The fw update was successful, no workaround is needed. To see the status switch to successful, clear the eMMC and run it again:

https://${bmcip}/redfish/v1/Managers/HGX_BMC_0/Actions/Oem/eMMC.SecureErase
GPU firmware cannot be updated until after SBIOS boots into UEFI.

The Grace PCIe PERST signal causes the GPU firmware’s MCTP stack to not work reliably until the UEFI stage of boot (PERST de-asserted).

Workaround

Wait for the SBIOS to boot into UEFI to run a firmware update.
The SBIOS will not boot after an L2 reset.

If the Boot Chain is corrupt or does not boot, the Grace/EROT tries to fall back to Boot Chain 1. On an earlier version of the SBIOS, Socket 0 successfully resets, but Socket 1 fails with a SPI contention.

Workaround

Ensure that Boot Chain 0 contains a proper image.
After an Aux or PDU power cycle, the BMC may fail to restore the host to previous power state.

If PowerRestorePolicy is set to LastState, the BMC may occasionally fail to restore to the previous host power state after an Aux or PDU power cycle. For example, if the previous host state was On, the BMC may not turn on the host automatically after power is restored.

Workaround

If there is a fixed desired power state for the host after the power is restored, user can set the PowerRestorePolicy to either AlwaysOn or AlwaysOff. Otherwise, the user can send an additional power on or power off command after the power is restored to achieve the desired host power state.
Leak Detection Voltage Sensors does not show Degraded State.

When the BMC gets an out-of-range reading from the voltage-based leak detection sensor, it marks the Leak Detector’s Status:State as Degraded. The corresponding Voltage Sensor does not currently align its Status:State property with the Leak Detector to reflect the Degraded value.

Workaround

It is recommended to use the Leak Detector Redfish URIs to get the most accurate state of the leak detector system. The sensor will only provide the raw voltage value for diagnostic purposes.

Leak Detector URIs:
- /redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectors/Chassis_0_LeakDetector_0_ColdPlate
- /redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectors/Chassis_0_LeakDetector_0_Manifold
- /redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectors/Chassis_0_LeakDetector_1_ColdPlate
- /redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectors/Chassis_0_LeakDetector_1_Manifold
No mechanism to disable host IPMI interface.

This firmware release does not provide a mechanism to disable the host IPMI interface and any privileged user running on a compute tray can send IPMI commands to the BMC on the same compute tray.

Workaround

A future BMC firmware release will provide an interface to restrict host IPMI commands. In the meantime, necessary access controls need to be implemented on the compute trays to limit host privileged access by users not authorized to access the BMC.
BMC SKU ID is sometimes not available through Redfish.

The BMC’s SKU ID should be available through the following Redfish URI under the “SKU” property: /redfish/v1/Chassis/BMC_0

Due to a current issue on the BMC, this property may not be available on all systems.

Workaround

There is currently no known workaround. The plan is to resolve this issue for the next release.
Cannot set IPv6 address through Redfish.

There is a current BMC bug where setting the IPv6 through the following Redfish URI fails: /redfish/v1/Managers/BMC_0/NetworkProtocol

The command will return the following message:
```
"Message": "The property 'Address or Port' with the requested value of
'\"2620:10d:c0a3:103::9:7554\"' could not be written because the value does not
meet the constraints of the implementation."
```
This is a known issue and will be addressed in the next release.

Workaround

There is currently no known workaround. The plan is to resolve this issue for the next release.
No error Redfish response when credentials are not provided.

If no credentials are provided in a Redfish command, the BMC currently does not return an explicit message indicating the lack of credentials. A 401 Unauthorized response is still returned, and access is prevented. This only applies if no credentials are provided. If credentials are provided but are incorrect, then the BMC will return a message indicating invalid credentials.

Workaround

The return code can be checked to confirm the status of the command. The 401 will indicate to the user that access was unauthorized.
Redfish GPU_0/Ports/NVLink_0, the property “LinkState” value does not change after a DC Cycle.

NVLinks that were previously disabled through Redfish continue to report the LinkState as Enabled and LinkStatus as LinkUp.

Workaround

Use host reporting tools (nvidia-smi/NVML/DCGM) to fetch the NVLink status for disabled links.
No option in the webgui to power on the system.

Currently, there is no option to power on the system from the BMC Webgui.

Workaround

Use redfish or ipmitool to power on the system.
On certain DGX GB200 NVSwitch platforms, the switch reboot history may display multiple unexpected reboots with the reason shown as Platform reset (retrieved using the NVOS command: nv show sys reboot -o json | jq -r .reason). NVIDIA plans to include a fix in a future NVOS release.
[5770595] Leak detector voltage values read through the compute tray BMC via ipmi are clamped to 0.165 and 1.815.

Workaround

In order to read the correct voltage value if the sensor is outside of that range, read the sensor value through redfish under Chassis_0/Sensors.

Customers are always advised to use the standard Leak Detection methods under /redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectors for determining if a leak is present rather than reading the voltage value of the sensor directly.