Known Issues#
Functional Issues#
You cannot update firmware of the individual components of the DGX H100/H200 GPU tray. For example, you can not individually update the firmware for the GPU only. You must update the firmware by flashing the entire DGX H100/H200 GPU tray.
Firmware download is not automatic. You must download the firmware manually from the NVIDIA Enterprise Support Portal.
For systems running DGX OS 6.0, the
nvfwupd
command-line utility that is shown in sample commands is not automatically installed. You must download the utility from the NVIDIA Enterprise Support Portal. For systems running DGX OS version 6.1 or later, thenvfwupd
command-line utility is included with the operating system.
Redfish Service on BMC Might Experience Intermittent Unresponsiveness#
Issue#
An intermittent issue affecting several nodes where the BMC Redfish service might become unresponsive. When this happens, any attempt to query Redfish returns the following error:
"code": "Base.1.12.ServiceInUnknownState",
"message": "The operation failed because the service is in an unknown state and can no longer take incoming requests."
Workaround#
For systems that have exhibited the "code": "Base.1.12.ServiceInUnknownState"
failure,
follow these steps as part of updating the new BMC firmware:
Option 1: Reinitialize the Redis database using the Redfish API
Perform a Redis database reset by invoking the
BMC/Actions/Oem/AMIManager.RedfishDBReset
action.curl -k -u <username>:<password> --request POST --location 'https://$BMCIP/redfish/v1/Managers/BMC/Actions/Oem/AMIManager.RedfishDBReset' --header 'Content-Type: application/json' --data '{"RedfishDBResetType":"ResetAll"}' | jq
Example response:
{ "@odata.context": "/redfish/v1/$metadata#Task.Task", "@odata.id": "/redfish/v1/TaskService/Tasks/1", "@odata.type": "#Task.v1_4_2.Task", "Description": "Task for RedfishDBReset Task", "Id": "1", "Name": "RedfishDBReset Task", "TaskState": "New" }
Wait approximately 3-4 minutes for the Redfish service to recover and stabilize.
Run the following
curl
command to reboot the BMC.curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/Managers/BMC/Actions/Manager.Reset' --header 'Content-Type: application/json' --data '{"ResetType": "ForceRestart"}'
Option 2: Restore to defaults using IPMItool while preserving all configuration settings except Redfish
To preserve all configuration settings except the Redfish configuration using IPMItool:
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 00 00 ff 77
Set all preserve configuration settings except Redfish (byte 2, bit 6).
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0xff 0x37
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 ff 37 ff 77
Initiate a restore to defaults, which will cause the BMC to reboot.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0x66
After the BMC finishes rebooting, restore all settings to their initial state.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0x00 0x00
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 00 00 ff 77
USB1 Port Missing on Occasion after a BMC Cold Reset#
Issue#
On the DGX system with BMC version 24.09.17 and HMC version rc67, the USB1 port becomes unreachable on occasion as shown on the BMC console after running a BMC cold reset:
ipmitool -H <bmc-ip-address> -I lanplus -U <bmc-username> -P <bmc-password> mc reset cold
Explanation#
On a cold reset, the HMC might reset as well, resulting in a short delay in baseboard telemetry.
Workaround#
Periodically issue the following command to determine if the HMC is up. When the command returns a response, the HMC is operating.
curl -k -u <bmc-user>:<password> --request PATCH 'https://<bmc-ip-address>/redfish/v1/Chassis/HGX_BMC_0'
Misleading Messages During Firmware Update#
Issue#
During the process of the ConnectX-7 firmware update, upon completion of applying the update,
a reboot is required as suggested by these messages: To load new FW, run mlxfwreset or reboot machine.
and Please reboot machine to load new configurations.
However, rebooting the system does not load
the firmware update or new configurations properly for the ConnectX-7 firmware versions 28.36.1010 and later.
Workaround#
For the firmware update and new configurations to load successfully, perform an AC power cycle on the system instead of rebooting.
Sensors Endpoint for the Redfish API Does Not Support $expand#
Issue#
An HTTP GET request to the sensors endpoint with an $expand argument like the following fails.
/redfish/v1/Chassis/DGX/Sensors?$expand=.($levels=3)
Workaround#
You can request sensor data from the Redfish API by requesting one sensor at a time. You can use the IPMI tool to request sensor data.
Firmware Upgrade or Downgrade Can Fail#
Issue#
When you perform a firmware upgrade or downgrade, the change can fail with a message like the following example:
...
[Sat 19 Aug 2023 08:20:50 AM CST] Firmware update task ended with state Exception, percentComplete: [98]
[Sat 19 Aug 2023 08:20:50 AM CST] Update RC: 1
[Sat 19 Aug 2023 08:20:50 AM CST] Collect RF task
[Sat 19 Aug 2023 08:21:01 AM CST] Update failed with [nvfw_DGX-H100_0005_230615.1.0_dbg-signed.fwpkg]:[/redfish/v1/UpdateService/FirmwareInventory/EROT_BMC_0]
Workaround#
Retry the firmware upgrade or downgrade.
Firmware Inventory Can Be Invalid During Boot#
Issue#
In rare instances, polling the firmware inventory endpoint of the BMC Redfish API can report an inaccurate firmware versions for the HGX_0 component.
Workaround#
Query the firmware inventory after the system completes the boot sequence to retrieve the current firmware inventory.
BMC Slow Startup After AC Power Cycle#
Issue#
After an AC power cycle, the BMC can require approximately 10 minutes before it is available for communication. The BMC is typically available within three minutes.
Workaround#
No workaround is available.
Temperature Sensors Can Report No Reading#
Issue#
The following sensors can report No Reading
rather than a temperature value:
TEMP_PSU4
TEMP_PSU5
PWR_PSU5
SPD_FAN_PSU5_R
SPD_FAN_PSU5_R
STATUS_PSU0
STATUS_PSU1
STATUS_PSU2
STATUS_PSU3
STATUS_PSU4
STATUS_PSU5
STATUS_HMC
TEMP_PCIE_SW_1
TEMP_Cedar_OSFP0
TEMP_Cedar_OSFP1
TEMP_Cedar_OSFP2
TEMP_Cedar_OSFP3
TEMP_PCIE_CX7_1
TEMP_PCIE_CX7_2
TEMP_CX7_QSFP0
TEMP_CX7_QSFP1
TEMP_CX7_QSFP2
TEMP_CX7_QSFP3
TEMP_Intel_NIC
TEMP_NIC_QSFP0
TEMP_NIC_QSFP1
Workaround#
Polling the sensors again can resolve the issue.