Intelligent Platform Management Interface
The NVIDIA® BlueField® DPU provides management interfaces to the BMC and the BlueField device.
The BMC, based on the Intelligent Platform Management Interface (IPMI) standard, supports both out-of-band (OOB) dedicated interfaces, and a serial port to access the CLI of the BMC.
The BMC is connected to an external host server via LAN. IPMItool commands may be issued from the external server to retrieve information from the BMC as follows:
ipmitool -C 17 -I lanplus -H <bmc_ip_addr> -U ADMIN -P ADMIN <ipmitool_arguments>
The sections below provide more details about the IPMItool commands which are supported.
FRU Reading
To retrieve FRU info, run:
ipmitool -C 17 -I lanplus -H <bmc_ip_addr> -U ADMIN -P ADMIN fru print <fru-id>
FRU ID of the BMC FRU EEPROM is optional and can be found using the fru print command.
It is possible to dump the binary FRU data into a file. Run:
ipmitool -C 17 -I lanplus -H <bmc_ip_addr> -U ADMIN -P ADMIN fru read <fru-id> <filename>
The parameter <filename> is the absolute path to the file.
System Event Log
The system event log (SEL) is non-volatile repository for system events and certain system configuration information. SEL entries have a unique "record ID" field. This field is used for retrieving log entries from the SEL. Record IDs are not required to be sequential or consecutive. Applications should not assume that the SEL record ID follows any particular numeric ordering.
Event logs are chassis events, recorded in the BMC software which can be read using IPMI commands.
If the SEL is full and a new event is raised, the oldest record is removed and the new one is placed at the end of the SEL.
SEL may be accessed, even after BlueField failure, on the server through IPMI LAN access.
The following table lists the command to use in order to view event logs:
Command |
Description |
|
Displays information about SEL |
|
Displays list of events |
|
Displays extended info list of events |
|
Saves SEL events to a file |
|
Clears SEL |
SEL Messages
The following subsections detail the messages which are added to the BMC SEL and the scenarios that trigger them.
UEFI Boot
Messages are added to the BMC SEL while the DPU UEFI is booting which describe the status of the UEFI boot.
SEL messages:
SMBus initialization
PCI resource configuration
System boot initiated
Example:
SEL Record ID : 0037
Record Type : 02
Timestamp : 06:36:06 UTC 06:36:06 UTC
Generator ID : 0001
EvM Revision : 04
Sensor Type : System Firmwares
Sensor Number : 06
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data : c207ff
Description : PCI resource configuration
IPMB Sensors
QSFP Sensors
Messages are added to the SEL in case of a change in the status of the QSFP cables. The messages describe the event and status of the sensor.
List of QSFP sensors:
P0_link – the QSFP 0 cable status
P1_link – the QSFP 1 cable status
SEL messages:
Config Error – the QSFP cable is down
Connected – the QSFP cable is up
Example:
SEL Record ID : 003e
Record Type : 02
Timestamp : 07:08:28 UTC 07:08:28 UTC
Generator ID : 0020
EvM Revision : 04
Sensor Type : Cable / Interconnect
Sensor Number : 00
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data (RAW) : 010f0f
Event Interpretation : Missing
Description : Config Error
Sensor ID : p0_link (0x0)
Entity ID : 31.1
Sensor Type (Discrete): Cable / Interconnect
States Asserted : Cable / Interconnect
[Config Error]
Temperature Sensors
Messages are added to the SEL if temperature sensors detect a value higher than the sensor thresholds. The messages include a description of the event, DPU FRU device description, DPU BMC device description, and the status of the sensor.
List of temperature sensors:
bluefield_temp – Bluefield temperature
p0_temp – QSFP 0 cable temperature
p1_temp – QSFP 1 cable temperature
SEL messages:
Upper Critical going high – crossing a upper critical threshold.
Upper Non-critical going high – crossing a upper non-critical threshold.
Lower Critical going low – crossing a lower critical threshold.
Lower Non-critical going low – crossing a lower non-critical threshold.
Example:
SEL Record ID : 003c
Record Type : 02
Timestamp : 07:01:06 UTC 07:01:06 UTC
Generator ID : 0020
EvM Revision : 04
Sensor Type : Temperature
Sensor Number : 03
Event Type : Threshold
Event Direction : Assertion Event
Event Data (RAW) : 592802
Trigger Reading : 40.000degrees C
Trigger Threshold : 2.000degrees C
Description : Upper Critical going high
Sensor ID : p0_temp (0x3)
Entity ID : 0.1
Sensor Type (Threshold) : Temperature
Sensor Reading : 40 (+/- 0) degrees C
Status : ok
Lower Non-Recoverable : na
Lower Critical : -5.000
Lower Non-Critical : 0.000
Upper Non-Critical : 70.000
Upper Critical : 75.000
Upper Non-Recoverable : na
Positive Hysteresis : Unspecified
Negative Hysteresis : Unspecified
Assertion Events :
Event Enable : Event Messages Disabled
Assertions Enabled : lnc- lcr- unc+ ucr+
Deassertions Enabled : lnc+ lcr+ unc- ucr-
FRU Device Description : Nvidia-BMCMezz (ID 169)
Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC
Board Mfg : Nvidia
Board Product : Nvidia-BMCMezz
Board Serial : MT2251XZ02W5
Board Part Number : 900-9D3B6-00CV-AAA
FRU Device Description : BlueField-3 Smar (ID 250)
Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC
Board Mfg : Nvidia
Board Product : BlueField-3 SmartNIC Main Card
Board Serial : MT2251XZ02W5
Board Part Number : 900-9D3B6-00CV-AAA
Product Manufacturer : Nvidia
Product Name : BlueField-3 SmartNIC Main Card
Product Part Number : 900-9D3B6-00CV-AAA
Product Version : A3
Product Serial : MT2251XZ02W5
Product Asset Tag : 900-9D3B6-00CV-AAA
ADC Sensors
Messages are added to the SEL if the sensor voltage crosses the sensor's thresholds. The messages include a description of the event, DPU FRU device description, DPU BMC device description, and the status of the sensor.
List of ADC sensors:
1V_BMC
1_2V_BMC
1_8V
1_8V_BMC
2_5V
3_3V
3_3V_RGM
5V
12V_ATX
12V_PCIe
DVDD
HVDD
VDD
VDDQ
VDD_CPU_L
VDD_CPU_R
SEL messages:
Upper Non-critical going high – crossing a upper non-critical threshold
Lower Non-critical going low – crossing a lower non-critical threshold
Example:
SEL Record ID : 0042
Record Type : 02
Timestamp : 09:20:50 UTC 09:20:50 UTC
Generator ID : 0020
EvM Revision : 04
Sensor Type : Voltage
Sensor Number : 06
Event Type : Threshold
Event Direction : Assertion Event
Event Data (RAW) : 50a9ff
Trigger Reading : 1.200Volts
Trigger Threshold : 1.810Volts
Description : Lower Non-critical going low
Sensor ID : 1_2V_BMC (0x6)
Entity ID : 0.1
Sensor Type (Threshold) : Voltage
Sensor Reading : 1.200 (+/- 0) Volts
Status : ok
Lower Non-Recoverable : na
Lower Critical : na
Lower Non-Critical : 1.143
Upper Non-Critical : 1.257
Upper Critical : na
Upper Non-Recoverable : na
Positive Hysteresis : Unspecified
Negative Hysteresis : Unspecified
Assertion Events :
Event Enable : Event Messages Disabled
Assertions Enabled : lnc- unc+
Deassertions Enabled : lnc+ unc-
FRU Device Description : Nvidia-BMCMezz (ID 169)
Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC
Board Mfg : Nvidia
Board Product : Nvidia-BMCMezz
Board Serial : MT2251XZ02W5
Board Part Number : 900-9D3B6-00CV-AAA
FRU Device Description : BlueField-3 Smar (ID 250)
Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC
Board Mfg : Nvidia
Board Product : BlueField-3 SmartNIC Main Card
Board Serial : MT2251XZ02W5
Board Part Number : 900-9D3B6-00CV-AAA
Product Manufacturer : Nvidia
Product Name : BlueField-3 SmartNIC Main Card
Product Part Number : 900-9D3B6-00CV-AAA
Product Version : A3
Product Serial : MT2251XZ02W5
Product Asset Tag : 900-9D3B6-00CV-AAA
Sensor Data Record (SDR) Repository
Supported SDR Commands
BMC software supports reading chassis sensor information using the IPMItool.
The following table lists commands which allow reading SDR data:
Command |
Description |
|
Displays sensor data repository entry readings and their status |
|
Displays extended sensor information |
|
Displays sensors and thresholds in a wide table format |
|
Displays information for sensor data records specified by sensor ID |
|
Displays all records from the SDR repository of a specific type |
|
Displays information for sensors specified by name |
|
Displays readings for sensors specified by name (only for numeric sensors) |
|
If a threshold is crossed, a message is added to the Redfish event log, SEL, and journal. |
SDR Entry List
SDR contains information about the type and number of sensors. The following is a list of the available SDR information:
Managed Entity |
ID |
Sensor Name |
SFP link status |
0x0 0x1 |
|
NIC thermal sensors |
0x2 |
bluefield_temp |
SFP temperature sensors |
0x3 0x4 |
|
NIC voltage sensors |
0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf 0x10 0x11 0x12 0x13 0x14 |
ADC voltage sensors:
|
Rebooting BlueField with BMC
BMC software enables resetting the BlueField.
To reset the main CPU, run:
ipmitool -C 17 -I lanplus -H <bmc_ip_addr> -U ADMIN -P ADMIN chassis power reset
The BMC can retrieve information on BlueField's sensors and FRUs via IPMI over IPMB protocol. IPMItool commands can be issued from the BMC using the following format:
ipmitool -I ipmb <ipmitool_arguments>
List of IPMI Supported Sensors
Sensor |
Sensor ID |
Description |
bluefield_temp |
0 |
Support NIC monitoring of BlueField’s temperature |
ddr0_0_temp |
1 |
Support monitoring of DDR0 temp (on memory controller 0) |
ddr0_1_temp |
2 |
Support monitoring of DDR1 temp (on memory controller 0) |
ddr1_0_temp |
3 |
Support monitoring of DDR0 temp (on memory controller 1) |
ddr1_1_temp |
4 |
Support monitoring of DDR1 temp (on memory controller 1) |
p0_temp |
5 |
Port 0 temperature |
p1_temp |
6 |
Port 1 temperature |
p0_link |
7 |
Port0 link status
|
p1_link |
8 |
Port1 link status
|
List of IPMI Supported FRUs
FRU |
ID |
Description |
update_timer |
0 |
set_emu_param.service is responsible for collecting data on sensors and FRUs every 3 seconds. This regular update is required for sensors but not for FRUs whose content is less susceptible to change. update_timer is used to sample the FRUs every hour instead. Users may need this timer in the case where they are issuing several raw IPMItool FRU read commands. This helps in assessing how much time users have to retrieve large FRU data before the next FRU update. update_timer is a hexadecimal number. |
fw_info |
1 |
ConnectX firmware information, Arm firmware version, and MLNX_OFED version The fw_info is in ASCII format |
nic_pci_dev_info |
2 |
NIC vendor ID, device ID, subsystem vendor ID, and subsystem device ID The nic_pci_dev_info is in ASCII format |
cpuinfo |
3 |
CPU information reported in lscpu and /proc/cpuinfo The cpuinfo is in ASCII format |
ddr0_0_spd |
4 |
FRU for SPD MC0 DIMM 0 (MC = memory controller) The ddr0_0_spd is in binary format |
ddr0_1_spd |
5 |
FRU for SPD MC0 DIMM1 The ddr0_1_spd is in binary format |
ddr1_0_spd |
6 |
FRU for SPD MC1 DIMM0 The ddr1_0_spd is in binary format |
ddr1_1_spd |
7 |
FRU for SPD MC1 DIMM1 The ddr1_1_spd is in binary format |
emmc_info |
8 |
eMMC size, list of its partitions, and partitions usage (in ASCII format). eMMC CID, CSD, and extended CSD registers (in binary format). The ASCII data is separated from the binary data with ‘StartBinary’ marker. |
qsfp0_eeprom |
9 |
FRU for QSFP 0 EEPROM page 0 content (256 bytes in binary format) |
qsfp1_eeprom |
10 |
FRU for QSFP 1 EEPROM page 0 content (256 bytes in binary format) |
ip_addresses |
11 |
This FRU is empty at start time. It can be used to write the BMC port 0 and port 1 IP addresses to the BlueField. They follow these formats:
The size of the written file should be 61 bytes exactly. |
dimms_ce_ue |
12 |
FRU reporting the number of correctable and uncorrectable errors in the DIMMs. This FRU is updated once every 3 seconds. |
eth0 |
13 |
Network interface 0 information. Updated once every minute. |
Supported IPMI Commands
All of the following commands are prepended with ipmitool on the command line.
Commands |
IPMItool Command |
Relevant IPMI 2.0 Rev1.1 Spec Section |
Get Device ID |
mc info |
20.1 |
Broadcast "Get Device ID" |
Part of "mc info" |
20.9 |
Get BMC Global Enables |
mc getenables |
22.2 |
Get Device SDR Info |
sdr info |
35.2 |
Get Device SDR |
"sdr get", "sdr list" or "sdr elist" |
35.3 |
Get Sensor Hysteresis |
sdr get <sensor-id> |
35.7 |
Set Sensor Threshold |
sensor thresh <sensor-id> <threshold> <setting> |
35.8 |
Get Sensor Threshold |
sdr get <sensor-id> |
35.9 |
Get Sensor Event Enable |
sdr get <sensor-id> |
35.11 |
Get Sensor Reading |
sensor reading <sensor-id> |
35.14 |
Get Sensor Type |
sdr type <type> |
35.16 |
Read FRU Data |
fru read <fru-number> <file-to-write-to> – provides FRU data |
34.2 |
Get SDR Repository Info |
sdr info |
33.9 |
Get SEL Info |
"sel" or "sel info" |
40.2 |
Get SEL Allocation Info |
"sel" or "sel info" |
40.3 |
Get SEL Entry |
"sel list" or "sel elist" |
40.5 |
Delete SEL Entry |
sel delete <id> |
40.8 |
Clear SEL |
sel clear |
40.9 |
The BMC has 2 IPMB modes. It can be used as a requester or responder.
Requester Mode
When used as a requester, the BMC sends IPMB request messages to the BlueField via SMBus 0. The BlueField then processes the request and sends a message back to the BMC.Responder Mode
When used as a responder, the BMC receives IPMB request messages from the BlueField on SMBus 0. It then processes the message and sends a response back to the BlueField.
Both modes are enabled automatically at boot time.
For more information on how to use IPMI, please refer to the IPMI 2.0 standard.
BMC supports IPMI boot option selection commands. UEFI on BlueField-2 can query for the boot options through an IPMI command over IPMB. Currently the UEFI on BlueField-2 supports only the option to change the boot device selector flag with the following supported options: PXE boot or the default boot device as selected in the boot menu on BlueField-2.
Get current setting – ipmitool chassis bootparam get 5
Force pxe boot – ipmitool chassis bootparam set bootflag force_pxe
Default boot device – ipmitool chassis bootparam set bootflag none
The DPU boot override setting from BMC is persistent until it is set to none or the BFB image is updated again.
BMC supports reset control of BlueField-2 through the GPIOs connected to the BMC.
Issue the following command from the BMC to get the power status of the DPU:
ipmitool chassis power status
To perform a reset of the DPU, use the following commands:
Description |
Command |
Hard reset of BlueField DPU (Arm cores and NIC) |
|
Hard reset of BlueField Arm cores |
|
Hard reset of the BlueField DPU is allowed only when the host asserts:
PERST signal on BlueField-2
All_STANDBY signal on BlueField-3
OEM command 0xA1 is defined for additional non-standard reset controls of BlueField-2 from BMC under the OEM NetFn group 0x30.
NVIDIA OEM command to reset BlueField DPU:
Request |
Response |
Reset Option |
|
Completion code:
|
|