Known Issues#

Functional Issues#

  • You cannot update firmware of the individual components of the DGX B300 GPU tray. For example, you can not individually update the firmware for the GPU only. You must update the firmware by flashing the entire DGX B300 GPU tray.

  • Firmware download is not automatic. You must download the firmware manually from the NVIDIA Enterprise Support Portal.

PWR_SYSTEM Reports 0W Power Reading in Busbar System#

Issue#

The PWR_SYSTEM reading is pulled directly from IPMI sensors rather than the telemetry service. Consequently, the busbar system value remains at 0.

Workaround#

Reconfigure PWR_SYSTEM to pull from the telemetry service. Total system power is now calculated by summing the following metrics:

  • PWR_50V_PDB_HSC

  • PWR_PDB_MB_HSC

  • PWR_PDB_FAN_HSC

  • PWR_GB1_TOT_HSC

Custom Domain Policy Creation Issue Encountered#

Issue#

When transitioning from DGX B200 to DGX B300, an attempt to create a custom domain policy using a JSON file results in an error similar to:

"message": " /redfish/v1/Managers/BMC/NodeManager/Domains is failed, Server provided invalid data.
Please try again after sometime."

Custom domain policies are not supported in the current release. This functionality is slated for an upcoming release.

Intermittent Timeout During Update All Firmware Process#

Issue#

Executing the complete firmware update (using the update all operation) might intermittently result in exceptions or timeout errors. A timeout error occurs when the system fails to receive a response from the firmware task status service (#/FwTaskStatus) within the 120-second window.

Error Message:

"Message":"The timeout duration (120s) was exceeded before the operation on #/FwTaskStatus responded."

Workaround#

If you encounter this issue, retry the operation. A subsequent attempt typically succeeds without further intervention. If the TaskID displays the error message, retry querying the firmware task status service (TaskStatus).

Firmware Inventory Can Be Invalid During Boot#

Issue#

In rare instances, polling the firmware inventory endpoint of the BMC Redfish API can report an inaccurate firmware versions for the HGX_0 component.

Workaround#

Query the firmware inventory after the system completes the boot sequence to retrieve the current firmware inventory.

BMC Slow Startup After AC Power Cycle#

Issue#

After an AC power cycle, the BMC can require approximately 10 minutes before it is available for communication. The BMC is typically available within 3 minutes.

Workaround#

No workaround is available.

Temperature Sensors Can Report No Reading#

Issue#

The following sensors can report No Reading rather than a temperature value:

  • TEMP_PSU4

  • TEMP_PSU5

  • PWR_PSU5

  • SPD_FAN_PSU5_R

  • SPD_FAN_PSU5_R

  • STATUS_PSU0

  • STATUS_PSU1

  • STATUS_PSU2

  • STATUS_PSU3

  • STATUS_PSU4

  • STATUS_PSU5

  • STATUS_HMC

  • TEMP_PCIE_SW_1

  • TEMP_Cedar_OSFP0

  • TEMP_Cedar_OSFP1

  • TEMP_Cedar_OSFP2

  • TEMP_Cedar_OSFP3

  • TEMP_PCIE_CX7_1

  • TEMP_PCIE_CX7_2

  • TEMP_CX7_QSFP0

  • TEMP_CX7_QSFP1

  • TEMP_CX7_QSFP2

  • TEMP_CX7_QSFP3

  • TEMP_Intel_NIC

  • TEMP_NIC_QSFP0

  • TEMP_NIC_QSFP1

Workaround#

Polling the sensors again can resolve the issue.

Status#

While not fully resolved, this issue now appears less frequently.