Resolved Issues#
The following issues that were previously identified as known issues have been resolved.
Redfish Service on BMC Might Experience Intermittent Unresponsiveness#
Issue#
An intermittent issue affecting several nodes where the BMC Redfish service might become unresponsive. When this happens, any attempt to query Redfish returns the following error:
"code": "Base.1.12.ServiceInUnknownState",
"message": "The operation failed because the service is in an unknown state and can no longer take incoming requests."
Workaround#
For systems that have exhibited the "code": "Base.1.12.ServiceInUnknownState" failure,
follow these steps as part of updating the new BMC firmware:
Option 1: Reinitialize the Redis database using the Redfish API
Perform a Redis database reset by invoking the
BMC/Actions/Oem/AMIManager.RedfishDBResetaction.curl -k -u <username>:<password> --request POST --location 'https://$BMCIP/redfish/v1/Managers/BMC/Actions/Oem/AMIManager.RedfishDBReset' --header 'Content-Type: application/json' --data '{"RedfishDBResetType":"ResetAll"}' | jq
Example response:
{ "@odata.context": "/redfish/v1/$metadata#Task.Task", "@odata.id": "/redfish/v1/TaskService/Tasks/1", "@odata.type": "#Task.v1_4_2.Task", "Description": "Task for RedfishDBReset Task", "Id": "1", "Name": "RedfishDBReset Task", "TaskState": "New" }
Wait approximately 3-4 minutes for the Redfish service to recover and stabilize.
Run the following
curlcommand to reboot the BMC.curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/Managers/BMC/Actions/Manager.Reset' --header 'Content-Type: application/json' --data '{"ResetType": "ForceRestart"}'
Option 2: Restore to defaults using IPMItool while preserving all configuration settings except Redfish
To preserve all configuration settings except the Redfish configuration using IPMItool:
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 00 00 ff 77
Set all preserve configuration settings except Redfish (byte 2, bit 6).
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0xff 0x37
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 ff 37 ff 77
Initiate a restore to defaults, which will cause the BMC to reboot.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0x66
After the BMC finishes rebooting, restore all settings to their initial state.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0x00 0x00
Get the current preserve configuration settings.
$ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB 00 00 00 ff 77
Status#
This issue was resolved in versions 25.06.4 and later.
Issues with ConnectX-7 Network (Cluster) Card Firmware#
Issue#
If the NVIDIA® ConnectX®-7 Network (Cluster) Card firmware version 28.39.3560 is currently installed on your DGX H100/H200 system, you might encounter the following issues:
After a long runtime on a DGX H100/H200 system, one or more GPUs might fall off the bus, and the
nvidia-smicommand fails to run. After a power cycle, the system will recover, and all GPUs will be operational. The system will continue to run again without any issues for a long time.After a reboot or power cycle, one or more OSFP ports on the DGX system might remain in the
Downstate.
Resolution#
To prevent these issues, NVIDIA recommends updating the firmware of the following ConnectX-7 network cards to version 28.42.1000:
NVIDIA ConnectX-7 Card |
Version for the 24.09.1 Release |
Recommended Version |
|---|---|---|
Network (cluster) card |
28.39.3560 |
28.42.1000 |
Network (storage) card |
28.39.3560 |
28.42.1000 |
For more information, refer to DGX H100/H200 - Update for ConnectX-7 Networking Cards Available.
Platform DGX H200 Not Supported#
Issue#
On DGX H200 systems with nvfwupd version 2.0.1 installed, the following error
message might appear when you update the firmware using the nvfwupd command.
Platform dgxh200 not supported.
Explanation#
Starting with nvfwupd version 2.0.1, the server type is required to update the firmware
on new DGX platforms. An enhanced solution to automatically detect the server type for DGX platforms
will be available in a future release.
Status#
Resolved in nvfwupd version 2.0.4.
The ipmitool dcmi power reading Command Returns 0 Power Reading Value#
Issue#
When you use the ipmitool dcmi power reading command to report the power consumption data,
the command reports 0 Watts for the power reading value as shown in the following example:
$ sudo ipmitool -I lanplus -H IPaddress -U user -P password dcmi power reading
Instantaneous power reading: 0 Watts
Minimum during sampling period: 0 Watts
Maximum during sampling period: 7852 Watts
Average power reading over sample period: 1885 Watts
IPMI timestamp: Jan 12 09:20:45 2024
Sampling period: 00000005 Seconds
Power reading state is: activated
Status#
Resolved in version 24.09.1.
GPUs Show Exclamation Mark in BMC Web Interface#
Issue#
When you view the GPUs from the BMC web interface, the GPUs are shown
with an exclamation mark (
).
Explanation#
The icon is a false positive.
You can view the results of the nvsm show health command to confirm that the GPU status is healthy.
Status#
Resolved in version 1.1.3.
BMC LDAP Fields Do Not Support Space or Slash Characters#
Issue#
The BMC LDAP settings do not support the space or slash characters as part of the bind DN or search base. The following DN results in a failure:
DC=Echo Studios,DC=com
Status#
Resolved in version 24.09.1.
NVMe Information Not Visible in BCM Web Interface#
Issue#
In some cases, the NVMe information is not visible in the BMC web interface.
Status#
Resolved in version 24.09.1.