Known Issues#

Functional Issues#

  • You cannot update firmware of the individual components of the DGX B300 GPU tray. For example, you can not individually update the firmware for the GPU only. You must update the firmware by flashing the entire DGX B300 GPU tray.

  • Firmware download is not automatic. You must download the firmware manually from the NVIDIA Enterprise Support Portal.

Redfish BIOS Boot Order Update Not Retained After Reboot#

Issue#

Updating the BIOS boot order via Redfish might fail to take effect after a system reboot.

When initially checking the current BIOS boot order using Redfish, the reported order correctly matches the output from the local OS utility, efibootmgr. After successfully submitting a new boot order via the Redfish API and performing a system reboot, the actual boot sequence used by the system does not reflect the order set via Redfish. Specifically, the system does not boot from PXE as intended (for example, PXE/Boot0004), and the boot order reported by Redfish and efibootmgr becomes inconsistent.

Workaround#

To configure the boot order:

  1. Enter the SBIOS setup.

  2. Navigate to the Boot menu (or screen).

  3. Specify your preferred boot sequence.

Instantaneous Power Reading Persists#

Issue#

The instantaneous power consumption reported by the system remains the same when the chassis power is ON, but the system power is OFF.

A resolution to this issue is under investigation and will be provided in an upcoming release.

Custom Domain Policy Creation Issue Encountered#

Issue#

When transitioning from DGX B200 to DGX B300, an attempt to create a custom domain policy using a JSON file results in an error similar to:

"message": " /redfish/v1/Managers/BMC/NodeManager/Domains is failed, Server provided invalid data.
Please try again after sometime."

Custom domain policies are not supported in the current release. This functionality is slated for an upcoming release.

Intermittent Exceptions During Complete Firmware Update#

Issue#

Executing the complete firmware update (“Update All” operation) might intermittently yield exceptions.

Workaround#

When encountering this issue, retry the operation. A subsequent attempt is typically successful.

BlueField-3 DPU Firmware Version Not Displayed in Inventory After Upgrade#

Issue#

When a BlueField-3 DPU is upgraded to a firmware version greater than 32.43.2024, the system’s inventory might not accurately display the new firmware version. The inventory might continue to show the previously installed version or no version information for the DPU.

Workaround#

Currently, there is no direct workaround to force the inventory to display the updated firmware version beyond 32.43.2024. A resolution for this issue is under investigation and will be provided in a future release.

Redfish Service on BMC Might Experience Intermittent Unresponsiveness#

Issue#

An intermittent issue affecting several nodes where the BMC Redfish service might become unresponsive. When this happens, any attempt to query Redfish returns the following error:

"code": "Base.1.12.ServiceInUnknownState",
"message": "The operation failed because the service is in an unknown state and can no longer take incoming requests."

Workaround#

For systems that have exhibited the "code": "Base.1.12.ServiceInUnknownState" failure, follow these steps as part of updating the new BMC firmware:

Option 1: Reinitialize the Redis database using the Redfish API

  1. Perform a Redis database reset by invoking the BMC/Actions/Oem/AMIManager.RedfishDBReset action.

    curl -k -u <username>:<password> --request POST --location 'https://$BMCIP/redfish/v1/Managers/BMC/Actions/Oem/AMIManager.RedfishDBReset' --header 'Content-Type: application/json' --data '{"RedfishDBResetType":"ResetAll"}' | jq
    

    Example response:

    {
      "@odata.context": "/redfish/v1/$metadata#Task.Task",
      "@odata.id": "/redfish/v1/TaskService/Tasks/1",
      "@odata.type": "#Task.v1_4_2.Task",
      "Description": "Task for RedfishDBReset Task",
      "Id": "1",
      "Name": "RedfishDBReset Task",
      "TaskState": "New"
    }
    
  2. Wait approximately 3-4 minutes for the Redfish service to recover and stabilize.

  3. Run the following curl command to reboot the BMC.

    curl -k -u <bmc-user>:<password> --request POST --location 'https://<bmc-ip-address>/redfish/v1/Managers/BMC/Actions/Manager.Reset'  --header 'Content-Type: application/json'  --data '{"ResetType":  "ForceRestart"}'
    

Option 2: Restore to defaults using IPMItool while preserving all configuration settings except Redfish

To preserve all configuration settings except the Redfish configuration using IPMItool:

  1. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 00 00 ff 77
    
  2. Set all preserve configuration settings except Redfish (byte 2, bit 6).

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0xff 0x37
    
  3. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 ff 37 ff 77
    
  4. Initiate a restore to defaults, which will cause the BMC to reboot.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0x66
    
  5. After the BMC finishes rebooting, restore all settings to their initial state.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBA 0x00 0x00
    
  6. Get the current preserve configuration settings.

    $ ipmitool -H $BMCIP -I lanplus -U <username> -P <password> raw 0x32 0xBB
    00 00 00 ff 77
    

Misleading Messages During Firmware Update#

Issue#

During the process of the ConnectX-7 firmware update, upon completion of applying the update, a reboot is required as suggested by these messages: To load new FW, run mlxfwreset or reboot machine. and Please reboot machine to load new configurations. However, rebooting the system does not load the firmware update or new configurations properly for the ConnectX-7 firmware versions 28.36.1010 and later.

Workaround#

For the firmware update and new configurations to load successfully, perform an AC power cycle on the system instead of rebooting.

Firmware Inventory Can Be Invalid During Boot#

Issue#

In rare instances, polling the firmware inventory endpoint of the BMC Redfish API can report an inaccurate firmware versions for the HGX_0 component.

Workaround#

Query the firmware inventory after the system completes the boot sequence to retrieve the current firmware inventory.

BMC Slow Startup After AC Power Cycle#

Issue#

After an AC power cycle, the BMC can require approximately 10 minutes before it is available for communication. The BMC is typically available within three minutes.

Workaround#

No workaround is available.

Temperature Sensors Can Report No Reading#

Issue#

The following sensors can report No Reading rather than a temperature value:

  • TEMP_PSU4

  • TEMP_PSU5

  • PWR_PSU5

  • SPD_FAN_PSU5_R

  • SPD_FAN_PSU5_R

  • STATUS_PSU0

  • STATUS_PSU1

  • STATUS_PSU2

  • STATUS_PSU3

  • STATUS_PSU4

  • STATUS_PSU5

  • STATUS_HMC

  • TEMP_PCIE_SW_1

  • TEMP_Cedar_OSFP0

  • TEMP_Cedar_OSFP1

  • TEMP_Cedar_OSFP2

  • TEMP_Cedar_OSFP3

  • TEMP_PCIE_CX7_1

  • TEMP_PCIE_CX7_2

  • TEMP_CX7_QSFP0

  • TEMP_CX7_QSFP1

  • TEMP_CX7_QSFP2

  • TEMP_CX7_QSFP3

  • TEMP_Intel_NIC

  • TEMP_NIC_QSFP0

  • TEMP_NIC_QSFP1

Workaround#

Polling the sensors again can resolve the issue.