Resolved Issues#

The following issues that were previously identified as known issues have been resolved.

Redfish Service on BMC Might Experience Intermittent Unresponsiveness#

Issue#

An intermittent issue affecting several nodes where the BMC Redfish service might become unresponsive. When this happens, any attempt to query Redfish returns the following error:

"code": "Base.1.12.ServiceInUnknownState",
"message": "The operation failed because the service is in an unknown state and can no longer take incoming requests."

Workaround#

For systems that have exhibited the "code": "Base.1.12.ServiceInUnknownState" failure, follow these steps as part of updating the new BMC firmware:

Option 1: Reinitialize the Redis database using the Redfish API

  1. Perform a Redis database reset by invoking the BMC/Actions/Oem/AMIManager.RedfishDBReset action.

    curl -k -u <username>:<password> --request POST --location 'https://$BMCIP/redfish/v1/Managers/BMC/Actions/Oem/AMIManager.RedfishDBReset' --header 'Content-Type: application/json' --data '{"RedfishDBResetType":"ResetAll"}' | jq
    

    Example response:

    {
      "@odata.context": "/redfish/v1/$metadata#Task.Task",
      "@odata.id": "/redfish/v1/TaskService/Tasks/1",
      "@odata.type": "#Task.v1_4_2.Task",
      "Description": "Task for RedfishDBReset Task",
      "Id": "1",
      "Name": "RedfishDBReset Task",
      "TaskState": "New"
    }
    
  2. Wait approximately 3-4 minutes for the Redfish service to recover and stabilize.

  3. Run the following curl command to reboot the BMC.

    curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/Managers/BMC/Actions/Manager.Reset' --header 'Content-Type: application/json' --data '{"ResetType": "ForceRestart"}'
    

Option 2: Restore to defaults using IPMItool while preserving all configuration settings except Redfish

To preserve all configuration settings except the Redfish configuration using IPMItool:

  1. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 00 00 ff 77
    
  2. Set all preserve configuration settings except Redfish (byte 2, bit 6).

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0xff 0x37
    
  3. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 ff 37 ff 77
    
  4. Initiate a restore to defaults, which will cause the BMC to reboot.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0x66
    
  5. After the BMC finishes rebooting, restore all settings to their initial state.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0x00 0x00
    
  6. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 00 00 ff 77
    

Status#

This issue was resolved in versions 25.06.4 and later.

Issues with ConnectX-7 Network (Cluster) Card Firmware#

Issue#

If the NVIDIA® ConnectX®-7 Network (Cluster) Card firmware version 28.39.3560 is currently installed on your DGX H100/H200 system, you might encounter the following issues:

  • After a long runtime on a DGX H100/H200 system, one or more GPUs might fall off the bus, and the nvidia-smi command fails to run. After a power cycle, the system will recover, and all GPUs will be operational. The system will continue to run again without any issues for a long time.

  • After a reboot or power cycle, one or more OSFP ports on the DGX system might remain in the Down state.

Resolution#

To prevent these issues, NVIDIA recommends updating the firmware of the following ConnectX-7 network cards to version 28.42.1000:

NVIDIA ConnectX-7 Card

Version for the 24.09.1 Release

Recommended Version

Network (cluster) card

28.39.3560

28.42.1000

Network (storage) card

28.39.3560

28.42.1000

For more information, refer to DGX H100/H200 - Update for ConnectX-7 Networking Cards Available.

Platform DGX H200 Not Supported#

Issue#

On DGX H200 systems with nvfwupd version 2.0.1 installed, the following error message might appear when you update the firmware using the nvfwupd command.

Platform dgxh200 not supported.

Explanation#

Starting with nvfwupd version 2.0.1, the server type is required to update the firmware on new DGX platforms. An enhanced solution to automatically detect the server type for DGX platforms will be available in a future release.

Status#

Resolved in nvfwupd version 2.0.4.

The ipmitool dcmi power reading Command Returns 0 Power Reading Value#

Issue#

When you use the ipmitool dcmi power reading command to report the power consumption data, the command reports 0 Watts for the power reading value as shown in the following example:

$ sudo ipmitool -I lanplus -H IPaddress -U user -P password dcmi power reading
Instantaneous power reading:                             0 Watts
Minimum during sampling period:                          0 Watts
Maximum during sampling period:                       7852 Watts
Average power reading over sample period:             1885 Watts
IPMI timestamp:                             Jan 12 09:20:45 2024
Sampling period:                                00000005 Seconds
Power reading state is:                                activated

Status#

Resolved in version 24.09.1.

GPUs Show Exclamation Mark in BMC Web Interface#

Issue#

When you view the GPUs from the BMC web interface, the GPUs are shown with an exclamation mark (excl-mark).

Explanation#

The icon is a false positive. You can view the results of the nvsm show health command to confirm that the GPU status is healthy.

Status#

Resolved in version 1.1.3.

BMC LDAP Fields Do Not Support Space or Slash Characters#

Issue#

The BMC LDAP settings do not support the space or slash characters as part of the bind DN or search base. The following DN results in a failure:

DC=Echo Studios,DC=com

Status#

Resolved in version 24.09.1.

NVMe Information Not Visible in BCM Web Interface#

Issue#

In some cases, the NVMe information is not visible in the BMC web interface.

Status#

Resolved in version 24.09.1.