Troubleshooting#
DCMI and SDR Data Not Updating After Heavy Telemetry Workload#
Issue:
After an extended period, DCMI power readings and the SDR sensor data show no variance. These symptoms indicate that an issue in data collection has occurred and automatic recovery failed.
Solution:
Follow these steps to verify the issue and restore real-time updates for DCMI and SDR sensor data.
Verify the readings.
Run the following command multiple times, observing if the power readings change:
ipmitool -H <bmc-ip-address> -U <user> -P <password> -I lanplus dcmi power reading
Reset the BMC.
If the values remain static after verification, perform a cold reset of the BMC:
ipmitool -H <bmc-ip-address> -U <user> -P <password> -I lanplus mc reset cold
This action will restart the telemetry service, resolving the stall and restoring real-time updates for both DCMI and SDR sensor data.
Troubleshooting an Unsuccessful Firmware Update#
Firmware Update Terminates due to Component Not Found#
When performing a firmware update of the GPU tray with the motherboard firmware package, the firmware update stops with the following output message:
...
{
"@odata.type": "#Message.v1_0_8.Message",
"Message": "Given PLDMBundle Status Message : Requested component was not found in the firmware bundle.",
"MessageArgs": [
"Requested component was not found in the firmware bundle."
],
"MessageId": "UpdateService.1.0.FwUpdateStatusMessage",
"Resolution": "None",
"Severity": "Warning"
},
...
The message indicates that the firmware file specified by the -p argument of the nvfwupd
command is invalid. Retry the update and specify the firmware file that matches the component.
For example, use the GPU firmware file, which contains the HGX string, for the GPU tray update.
Refer to DGX B300 System Firmware Update Guide Version 25.11.2 for the firmware file names and components.
No Devices Where Detected for Handle ID 0#
When performing a firmware update with the Redfish API, the following output message indicates that
the firmware file specified in the -F UpdateFile= argument is not the correct file for the component
specified in the JSON file.
...
{
"@odata.type": "#Message.v1_0_8.Message",
"Message": "Given PLDMBundle Status Message : No devices where detected for handle id 0.",
"MessageArgs": [
"No devices where detected for handle id 0"
],
"MessageId": "UpdateService.1.0.FwUpdateStatusMessage",
"Resolution": "None",
"Severity": "Warning"
},
...
Retry the update and specify the firmware file that matches the component. Refer to Redfish APIs Support in the NVIDIA DGX B300 System User Guide for information about using the Redfish API.
Wait for Firmware Update Started ID#
The output for an unsuccessful firmware update using the nvfwupd command can
look like the following example:
FW recipe: ['nvfw_DGXB300_xxxx_xxxxxx.x.x.fwpkg']
{"@odata.type": "#UpdateService.v1_6_0.UpdateService", "Messages": [{"@odata.type": "#Message.v1_0_8.Message", "Message": "A new task /redfish/v1/TaskService/Tasks/4 was created.", "MessageArgs": ["/redfish/v1/TaskService/Tasks/4"], "MessageId": "Task.1.0.New", "Resolution": "None", "Severity": "OK"}, {"@odata.type": "#Message.v1_0_8.Message", "Message": "The action UpdateService.MultipartPush was submitted to do firmware update.", "MessageArgs": ["UpdateService.MultipartPush"], "MessageId": "UpdateService.1.0.StartFirmwareUpdate", "Resolution": "None", "Severity": "OK"}]}
FW update started, Task Id: 4
Wait for FirmwareUpdateStarted Id in Messages
Wait for FirmwareUpdateStarted Id in Messages
Task Message: Task /redfish/v1/UpdateService/upload has stopped due to an exception condition.
Firmware update failed, retry the firmware update
Retry the firmware update, as indicated in the command output.