Intelligent Platform Management Interface

1.0

The NVIDIA® BlueField® DPU provides management interfaces to the BMC and the BlueField device.

The BMC, based on the Intelligent Platform Management Interface (IPMI) standard, supports both out-of-band (OOB) dedicated interfaces, and a serial port to access the CLI of the BMC.

The BMC is connected to an external host server via LAN. IPMItool commands may be issued from the external server to retrieve information from the BMC as follows:

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN <ipmitool_arguments>

The sections below provide more details about the IPMItool commands which are supported.

FRU Reading

To retrieve FRU info, run:

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN fru print <fru_id>

FRU ID of the BMC FRU EEPROM is optional and can be found using the fru print command.

It is possible to dump the binary FRU data into a file. Run:

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN fru read <fru_id> <filename>

Warning

The parameter <filename> is the absolute path to the file.


System Event Log

The system event log (SEL) is non-volatile repository for system events and certain system configuration information. SEL entries have a unique "record ID" field. This field is used for retrieving log entries from the SEL. Record IDs are not required to be sequential or consecutive. Applications should not assume that the SEL record ID follows any particular numeric ordering.

Event logs are chassis events, recorded in the BMC software which can be read using IPMI commands.

If the SEL is full and a new event is raised, the oldest record is removed and the new one is placed at the end of the SEL.

SEL may be accessed, even after BlueField failure, on the server through IPMI LAN access.

The following table lists the command to use to view event logs:

Command

Description

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel

Displays information about SEL

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel list

Displays list of events

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel elist

Displays extended info list of events

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel save <filename>

Saves SEL events to a file

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel clear

Clears SEL

SEL Messages

The following subsections detail the messages which are added to the BMC SEL and the scenarios that trigger them.

UEFI Boot

Messages are added to the BMC SEL while the DPU UEFI is booting which describe the status of the UEFI boot.

SEL messages:

  • SMBus initialization

  • PCI resource configuration

  • System boot initiated

Example:

Copy
Copied!
            

SEL Record ID : 0037 Record Type : 02 Timestamp : 06:36:06 UTC 06:36:06 UTC Generator ID : 0001 EvM Revision : 04 Sensor Type : System Firmwares Sensor Number : 06 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : c207ff Description : PCI resource configuration


IPMB Sensors

QSFP Sensors

Messages are added to the SEL in case of a change in the status of the QSFP cables. The messages describe the event and status of the sensor.

List of QSFP sensors:

  • P0_link – the QSFP 0 cable status

  • P1_link – the QSFP 1 cable status

SEL messages:

  • Config Error – the QSFP cable is down

  • Connected – the QSFP cable is up

Example:

Copy
Copied!
            

SEL Record ID : 003e Record Type : 02 Timestamp : 07:08:28 UTC 07:08:28 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Cable / Interconnect Sensor Number : 00 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data (RAW) : 010f0f Event Interpretation : Missing Description : Config Error   Sensor ID : p0_link (0x0) Entity ID : 31.1 Sensor Type (Discrete): Cable / Interconnect States Asserted : Cable / Interconnect [Config Error]


Temperature Sensors

Messages are added to the SEL if temperature sensors detect a value higher than the sensor thresholds. The messages include a description of the event, DPU FRU device description, DPU BMC device description, and the status of the sensor.

List of temperature sensors:

  • bluefield_temp – Bluefield temperature

  • p0_temp – QSFP 0 cable temperature

  • p1_temp – QSFP 1 cable temperature

SEL messages:

  • Upper Critical going high – crossing a upper critical threshold

  • Upper Non-critical going high – crossing a upper non-critical threshold

  • Lower Critical going low – crossing a lower critical threshold

  • Lower Non-critical going low – crossing a lower non-critical threshold

Example:

Copy
Copied!
            

SEL Record ID : 003c Record Type : 02 Timestamp : 07:01:06 UTC 07:01:06 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Temperature Sensor Number : 03 Event Type : Threshold Event Direction : Assertion Event Event Data (RAW) : 592802 Trigger Reading : 40.000degrees C Trigger Threshold : 2.000degrees C Description : Upper Critical going high   Sensor ID : p0_temp (0x3) Entity ID : 0.1 Sensor Type (Threshold) : Temperature Sensor Reading : 40 (+/- 0) degrees C Status : ok Lower Non-Recoverable : na Lower Critical : -5.000 Lower Non-Critical : 0.000 Upper Non-Critical : 70.000 Upper Critical : 75.000 Upper Non-Recoverable : na Positive Hysteresis : Unspecified Negative Hysteresis : Unspecified Assertion Events : Event Enable : Event Messages Disabled Assertions Enabled : lnc- lcr- unc+ ucr+ Deassertions Enabled : lnc+ lcr+ unc- ucr-   FRU Device Description : Nvidia-BMCMezz (ID 169) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : Nvidia-BMCMezz Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA   FRU Device Description : BlueField-3 Smar (ID 250) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : BlueField-3 SmartNIC Main Card Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA Product Manufacturer : Nvidia Product Name : BlueField-3 SmartNIC Main Card Product Part Number : 900-9D3B6-00CV-AAA Product Version : A3 Product Serial : MT2251XZ02W5 Product Asset Tag : 900-9D3B6-00CV-AAA

ADC Sensors

Messages are added to the SEL if the sensor voltage crosses the sensor's thresholds. The messages include a description of the event, DPU FRU device description, DPU BMC device description, and the status of the sensor.

List of ADC sensors:

  • 1V_BMC

  • 1_2V_BMC

  • 1_8V

  • 1_8V_BMC

  • 2_5V

  • 3_3V

  • 3_3V_RGM

  • 5V

  • 12V_ATX

  • 12V_PCIe

  • DVDD

  • HVDD

  • VDD

  • VDDQ

  • VDD_CPU_L

  • VDD_CPU_R

SEL messages:

  • Upper Non-critical going high – crossing a upper non-critical threshold

  • Lower Non-critical going low – crossing a lower non-critical threshold

Example:

Copy
Copied!
            

SEL Record ID : 0042 Record Type : 02 Timestamp : 09:20:50 UTC 09:20:50 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Voltage Sensor Number : 06 Event Type : Threshold Event Direction : Assertion Event Event Data (RAW) : 50a9ff Trigger Reading : 1.200Volts Trigger Threshold : 1.810Volts Description : Lower Non-critical going low   Sensor ID : 1_2V_BMC (0x6) Entity ID : 0.1 Sensor Type (Threshold) : Voltage Sensor Reading : 1.200 (+/- 0) Volts Status : ok Lower Non-Recoverable : na Lower Critical : na Lower Non-Critical : 1.143 Upper Non-Critical : 1.257 Upper Critical : na Upper Non-Recoverable : na Positive Hysteresis : Unspecified Negative Hysteresis : Unspecified Assertion Events : Event Enable : Event Messages Disabled Assertions Enabled : lnc- unc+ Deassertions Enabled : lnc+ unc-   FRU Device Description : Nvidia-BMCMezz (ID 169) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : Nvidia-BMCMezz Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA   FRU Device Description : BlueField-3 Smar (ID 250) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : BlueField-3 SmartNIC Main Card Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA Product Manufacturer : Nvidia Product Name : BlueField-3 SmartNIC Main Card Product Part Number : 900-9D3B6-00CV-AAA Product Version : A3 Product Serial : MT2251XZ02W5 Product Asset Tag : 900-9D3B6-00CV-AAA

Sensor Data Record (SDR) Repository

Supported SDR Commands

BMC software supports reading chassis sensor information using the IPMItool.

The following table lists commands which allow reading SDR data:

Command

Description

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sdr list

Displays sensor data repository entry readings and their status

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sdr elist

Displays extended sensor information

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sensor list

Displays sensors and thresholds in a wide table format

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sdr get <name>

Displays information for sensor data records specified by sensor ID

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sdr type <type>

Displays all records from the SDR repository of a specific type

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sensor get <sensor_name>

Displays information for sensors specified by name

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sensor reading <name>…<name>

Displays readings for sensors specified by name (only for numeric sensors)

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sensor ipmitool sensor thresh <sensor_name> upper <non_critical_value> <critical_value> 0 ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sensor ipmitool sensor thresh <sensor_name> lower 0 <critical_value> <non_critical_value>

  • If the original threshold value is >0, the new threshold values must be between 0-255

  • If the original threshold value is <0, the new threshold values must be between 0-127

If a threshold is crossed, a message is added to the Redfish event log, SEL, and journal.


SDR Entry List

SDR contains information about the type and number of sensors. The following is a list of the available SDR information:

Managed Entity

ID

Sensor Name

SFP link status

0x0

0x1

  • p0_link

  • p1_link

NIC thermal sensors

0x2

bluefield_temp

SFP temperature sensors

0x3

0x4

  • p0_temp

  • p1_temp

NIC voltage sensors

0x5

0x6

0x7

0x8

0x9

0xa

0xb

0xc

0xd

0xe

0xf

0x10

0x11

0x12

0x13

0x14

ADC voltage sensors:

  • 1V_BMC

  • 1_2V_BMC

  • 1_8V

  • 1_8V_BMC

  • 2_5V

  • 3_3V

  • 3_3V_RGM

  • 5V

  • 12V_ATX

  • 12V_PCIe

  • DVDD

  • HVDD

  • VDD

  • VDDQ

  • VDD_CPU_L

  • VDD_CPU_R

Rebooting BlueField with BMC

BMC software enables resetting the BlueField.

To reset the main CPU, run:

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN chassis power reset


The BMC can retrieve information on BlueField's sensors and FRUs via IPMI over IPMB protocol. IPMItool commands can be issued from the BMC using the following format:

Copy
Copied!
            

ipmitool -I ipmb <ipmitool_arguments>

List of IPMI Supported Sensors

Sensor

Sensor ID

Description

bluefield_temp

0

Support NIC monitoring of BlueField's temperature

ddr0_0_temp

1

Support monitoring of DDR0 temp (on memory controller 0)

ddr0_1_temp

2

Support monitoring of DDR1 temp (on memory controller 0)

ddr1_0_temp

3

Support monitoring of DDR0 temp (on memory controller 1)

ddr1_1_temp

4

Support monitoring of DDR1 temp (on memory controller 1)

p0_temp

5

Port 0 temperature

p1_temp

6

Port 1 temperature

p0_link

7

Port0 link status

  • 0x100 – connection OK

  • 0x200 – connection error

p1_link

8

Port1 link status

  • 0x100 – connection OK

  • 0x200 – connection error


List of IPMI Supported FRUs

FRU

ID

Description

update_timer

0

set_emu_param.service is responsible for collecting data on sensors and FRUs every 3 seconds. This regular update is required for sensors but not for FRUs whose content is less susceptible to change. update_timer is used to sample the FRUs every hour instead. Users may need this timer if they are issuing several raw IPMItool FRU read commands. This helps assess how many times users must retrieve large FRU data before the next FRU update.

update_timer is a hexadecimal number.

fw_info

1

ConnectX firmware information, Arm firmware version, and MLNX_OFED version

The fw_info is in ASCII format

nic_pci_dev_info

2

NIC vendor ID, device ID, subsystem vendor ID, and subsystem device ID

The nic_pci_dev_info is in ASCII format

cpuinfo

3

CPU information reported in lscpu and /proc/cpuinfo

The cpuinfo is in ASCII format

emmc_info

8

eMMC size, list of its partitions, and partitions usage (in ASCII format).

eMMC CID, CSD, and extended CSD registers (in binary format).

The ASCII data is separated from the binary data with StartBinary marker.

qsfp0_eeprom

9

FRU for QSFP 0 EEPROM page 0 content (256 bytes in binary format)

qsfp1_eeprom

10

FRU for QSFP 1 EEPROM page 0 content (256 bytes in binary format)

Note

Applicable for dual-port devices only.

ip_addresses

11

This FRU is empty at start time. It can be used to write the BMC port 0 and port 1 IP addresses to the BlueField. They follow these formats:

Copy
Copied!
            

BMC: XXX.XXX.XXX.XXX P0: XXX.XXX.XXX.XXX P1: XXX.XXX.XXX.XXX

The size of the written file should be 61 bytes exactly.

eth0

13

Network interface 0 information. Updated once every minute.

eth1

14

Network interface 1 information. Updated once every minute.

Note

Applicable for dual-port devices only.

bf_uid

15

BlueField device UUID


Supported IPMI Commands

All the following commands are prepended with ipmitool on the command line.

Commands

IPMItool Command

Relevant IPMI 2.0 Rev1.1 Spec Section

Get Device ID

mc info

20.1

Broadcast "Get Device ID"

Part of "mc info"

20.9

Get BMC Global Enables

mc getenables

22.2

Get Device SDR Info

sdr info

35.2

Get Device SDR

"sdr get", "sdr list" or

"sdr elist"

35.3

Get Sensor Hysteresis

sdr get <sensor_id>

35.7

Set Sensor Threshold

sensor thresh <sensor-id> <threshold> <setting>

35.8

Get Sensor Threshold

sdr get <sensor_id>

35.9

Get Sensor Event Enable

sdr get <sensor_id>

35.11

Get Sensor Reading

sensor reading <sensor_id>

35.14

Get Sensor Type

sdr type <type>

35.16

Read FRU Data

fru read <fru_number> <file_to_write_to> – provides FRU data

34.2

Get SDR Repository Info

sdr info

33.9

Get SEL Info

"sel" or "sel info"

40.2

Get SEL Allocation Info

"sel" or "sel info"

40.3

Get SEL Entry

"sel list" or "sel elist"

40.5

Delete SEL Entry

sel delete <id>

40.8

Clear SEL

sel clear

40.9


The BMC has 2 IPMB modes. It can be used as a requester or responder.

  • Requester Mode
    When used as a requester, the BMC sends IPMB request messages to the BlueField via SMBus 0. The BlueField then processes the request and sends a message back to the BMC.

  • Responder Mode
    When used as a responder, the BMC receives IPMB request messages from the BlueField on SMBus 0. It then processes the message and sends a response back to the BlueField.

Both modes are enabled automatically at boot time.

For more information on how to use IPMI, please refer to the IPMI 2.0 standard.

© Copyright 2023, NVIDIA. Last updated on Nov 13, 2023.