Known Issues

Functional Issues

  • You can not update firmware of the individual components of the DGX H100 GPU tray. For example, you can not individually update the firmware for the GPU only. You must update the firmware by flashing the entire DGX H100 GPU tray.

  • Firmware download is not automatic. You must download the firmware manually from the NVIDIA Enterprise Support Portal.

  • For systems running DGX OS 6.0, the nvfwupd command-line utility that is shown in sample commands is not automatically installed. You must download the utility from the NVIDIA Enterprise Support Portal. For systems running DGX OS 6.1, the nvfwupd command-line utility is included with the operating system.

The ipmitool dcmi power reading Command Returns 0 Power Reading Value

Issue

When you use the ipmitool dcmi power reading command to report the power consumption data, the command reports 0 Watts for the power reading value as shown in the following example:

$ sudo ipmitool -I lanplus -H IPaddress -U user -P password dcmi power reading
Instantaneous power reading:                             0 Watts
Minimum during sampling period:                          0 Watts
Maximum during sampling period:                       7852 Watts
Average power reading over sample period:             1885 Watts
IPMI timestamp:                             Jan 12 09:20:45 2024
Sampling period:                                00000005 Seconds
Power reading state is:                                activated

No workaround is available at the time of the firmware update 1.1.3 release; however, this issue is expected to be resolved in a later release.

Misleading Messages During Firmware Update

Issue

During the process of the ConnectX-7 firmware update, upon completion of applying the update, a reboot is required as suggested by these messages: To load new FW, run mlxfwreset or reboot machine. and Please reboot machine to load new configurations. However, rebooting the system does not load the firmware update or new configurations properly for the ConnectX-7 firmware versions 28.36.1010 and later.

Workaround

For the firmware update and new configurations to load successfully, perform an AC power cycle on the system instead of rebooting.

Sensors Endpoint for the Redfish API Does Not Support $expand

Issue

An HTTP GET request to the sensors endpoint with an $expand argument like the following fails.

/redfish/v1/Chassis/DGX/Sensors?$expand=.($levels=3)

Workaround

You can request sensor data from the Redfish API by requesting one sensor at a time. You can use the IPMI tool to request sensor data.

GPUs Show Exclamation Mark in BMC Web Interface

Issue

When you view the GPUs from the BMC web interface, the GPUs are shown with an exclamation mark (excl-mark).

Explanation

The icon is a false positive. You can view the results of the nvsm show health command to confirm that the GPU status is healthy.

Fixed Release

This issue is fixed with the v1.1.1 firmware update.

Firmware Upgrade or Downgrade Can Fail

Issue

When you perform a firmware upgrade or downgrade, the change can fail with a message like the following example:

...
[Sat 19 Aug 2023 08:20:50 AM CST] Firmware update task ended with state Exception, percentComplete: [98]
[Sat 19 Aug 2023 08:20:50 AM CST] Update RC: 1
[Sat 19 Aug 2023 08:20:50 AM CST] Collect RF task
[Sat 19 Aug 2023 08:21:01 AM CST] Update failed with [nvfw_DGX-H100_0005_230615.1.0_dbg-signed.fwpkg]:[/redfish/v1/UpdateService/FirmwareInventory/EROT_BMC_0]

Workaround

Retry the firmware upgrade or downgrade.

BMC LDAP Fields Do Not Support Space or Slash Characters

Issue

The BMC LDAP settings do not support the space or slash characters as part of the bind DN or search base. The following DN results in a failure:

DC=Echo Studios,DC=com

Workaround

No workaround is available.

Firmware Inventory Can Be Invalid During Boot

Issue

In rare instances, polling the firmware inventory endpoint of the BMC Redfish API can report an inaccurate firmware versions for the HGX_0 component.

Workaround

Query the firmware inventory after the system completes the boot sequence to retrieve the current firmware inventory.

BMC Slow Startup After AC Power Cycle

Issue

After an AC power cycle, the BMC can require approximately 10 minutes before it is available for communication. The BMC is typically available within three minutes.

Workaround

No workaround is available.

Temperature Sensors Can Report No Reading

Issue

The following sensors can report No Reading rather than a temperature value:

  • TEMP_Cedar_OSFP0

  • TEMP_Cedar_OSFP1

  • TEMP_Cedar_OSFP2

  • TEMP_Cedar_OSFP3

  • TEMP_PCIE_CX7_1

  • TEMP_PCIE_CX7_2

  • TEMP_CX7_QSFP0

  • TEMP_CX7_QSFP1

  • TEMP_CX7_QSFP2

  • TEMP_CX7_QSFP3

  • TEMP_Intel_NIC

  • TEMP_NIC_QSFP0

  • TEMP_NIC_QSFP1

Workaround

No workaround is available.

NVMe Information Not Visible in BCM Web Interface

Issue

In some cases, the NVMe information is not visible in the BCM web interface.

Workaround

No workaround is available.