What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

Power and Thermal

This guide outlines the list of messages and errors encountered from the power and thermal modules. Some of these messages are printed on the console and some in the RShim log.

Sensors

Arm thermal sensor data can be accessed using the sensors command. All the Arm thermal sensors and DDR temperature are provided under acpitz-acpi section.

Copy
Copied!
            

# sensors mlx5-pci-0300 Adapter: PCI adapter asic: +45.0C (crit = +91.0C, highest = +57.0C)   acpitz-acpi-0 Adapter: ACPI interface temp1: +41.4C (crit = +115.0C) temp2: +41.2C (crit = +115.0C) temp3: +39.4C (crit = +115.0C) temp4: +41.9C (crit = +115.0C) temp5: +42.0C (crit = +115.0C) temp6: +42.4C (crit = +115.0C) temp7: +42.4C (crit = +115.0C) temp8: +42.1C (crit = +115.0C) temp9: +44.4C (crit = +115.0C) temp10: +80.0C (crit = +105.0C)   nvme-pci-0600 Adapter: PCI adapter Composite: +40.9C (low = -40.1C, high = +114.8C) (crit = +122.8C) Sensor 1: +38.9C (low = -273.1C, high = +65261.8C)

The sensor name corresponding to the tempX node can be known by reading this node: /sys/bus/acpi/devices/LNXTHERM:(X-1)/description.

For example, the temperature value displayed next to temp1 corresponds to:

Copy
Copied!
            

# cat /sys/bus/acpi/devices/LNXTHERM\:00/description center


RShim Error Messages

The following messages are seen in RShim log. To see the RShim log, run the following commands:

Copy
Copied!
            

# echo "DISPLAY_LEVEL 2" > /dev/rshimX/misc # cat /dev/rshimX/misc

Message

Description

cannot access vr0

VR0 is not responding.

This indicates a hardware issue on the device.

cannot access vr1

VR1 is not responding.

This indicates a hardware issue on the device.

set_page err:X

VR is not responding.

This indicates a hardware issue on the device.

mfr_vr_mc err:X

Access to VR is inconsistent.

This indicates a hardware issue on the device, resulting in an unstable connection to the VR.

pmbus_lsb err:X

Access to VR is inconsistent.

This indicates a hardware issue on the device. resulting in an unstable connection to the VR.

read_vout err:X

Access to VR is inconsistent.

This indicates a hardware issue on the device, resulting in an unstable connection to the VR.

set_vout err:X

Unable to set the requested v-out value.

This indicates either the requested v-out is out of bounds or unstable connections to the VRs.

PTMERROR: VR access error

VR is not responding.

This indicates a hardware issue on the device.

PTMERROR: Unknown OPN

Power capping is disabled on the device because VRs are not detected and the OPN is not known.

CRITICAL ERROR: ATX power not detected! Halting system!!

This critical error message indicates that ATX power is not detected on the device and the system is halted. To recover, connect the ATX power cable and restart.

power capping disabled

This indicates that the power capping is disabled on the device.


Console Error Messages

Runtime messages related to power and thermal capping are logged to the console. These messages are in the following format:

Copy
Copied!
            

PTM:<timestamp>:<event_type>:<throttle_action>:<event_details>

Element

Description

timestamp

Current CPU cycle value since boot.

This is counted at the speed of the RShim clock.

event_type

1 – Thermal event

2 – Power event

throttle_action

0 – No change

1 – Switched to P0 (100%)

2 – Switched to P1 (80%)

3 – Switched to P2 (50%)

event_details

0 – None

1 – Device in LiveFish mode

3 – DDR reported error when reading temperature

4 – VR read error

6 – Power capping disabled

7 – Power capping enabled

8 – Thermal state is normal

9 – Thermal state is in Alarm-P1 state (temperature over threshold)

10 – Thermal state is in Alarm-P2 state (temperature consistently over threshold)

11 – DDR temperature over threshold


Abrupt System Halt in 150W BlueField-3 Platforms

On 150W BlueField platforms, system halt occurs in power capping code when the ATX cable is not connected or is removed during operation. In this case, the system is halted and the following message is printed on the RShim log and console:

Copy
Copied!
            

CRITICAL ERROR: ATX power not detected! Halting system!!

To resume normal operation, connect the ATX cable and power cycle the device.

To connect the ATX cable, use the following harness for BlueField-3:

harness1-version-1-modificationdate-1731453311203-api-v2.png

Tip

Good connectors for BlueField-3 are all black on the clipper side and should be easy to connect without using force.

Warning

Avoid common mistakes around matching the ATX harnesses to the BlueField-3!

  • The following connector, although 8 pins, is an ATX harness for GPUs and does not fit BlueField-3:

    harness-version-1-modificationdate-1731453311587-api-v2.png

  • It can be forced into the BlueField-3 ATX socket, but that should be avoided!

  • Note the all-yellow wires on the clipper side. The polarity is not right and will prevent the server from powering on.


© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.