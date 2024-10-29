The System Event Log (SEL) and Event Log in OpenBMC provide robust mechanisms for monitoring, diagnosing, and troubleshooting hardware and system issues.

SEL Functionality – The SEL captures and records significant system events related to hardware and firmware. This includes events such as hardware failures, temperature thresholds, power anomalies, and other critical system changes. Access – The SEL can be accessed via IPMI\Redfish commands, allowing administrators to query and retrieve logs for analysis Management – Administrators can clear, save, and manage SEL entries to maintain system health and ensure critical events are recorded accurately

Event Log: Functionality – The Event Log provides a comprehensive record of both hardware and software events, offering detailed insights into system operations and potential issues. This includes firmware updates, configuration changes, security alerts, etc. Access – The Event Log is accessible via Redfish interface, enabling easy retrieval and management of event data Management – Users can filter, sort, and analyze events to identify patterns, diagnose problems, and improve system reliability. The Event Log supports exporting logs for offline analysis and archiving.

Key features Scalability – Both the SEL and Event Log are designed to handle a high volume of events, ensuring no critical information is lost Integration – These logs integrate seamlessly with existing management tools, providing a unified view of system health and events Usability – User-friendly interfaces and command-line tools make it easy to access and manage logs, ensuring administrators can quickly respond to issues



Overall, the SEL and Event Log in OpenBMC are essential tools for maintaining system integrity, improving reliability, and ensuring swift resolution of any issues that arise.

Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/

Output example:

Copy Copied! { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog", "@odata.type": "#LogService.v1_1_0.LogService", "Actions": { "#LogService.ClearLog": { "target": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Actions/LogService.ClearLog" } }, "DateTime": "2023-09-27T14:28:50+00:00", "DateTimeLocalOffset": "+00:00", "Description": "System Event Log Service", "Entries": { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries" }, "Id": "EventLog", "Name": "Event Log Service", "Oem": { "Nvidia": { "@odata.type": "#NvidiaLogService.v1_0_0.NvidiaLogService", "LatestEntryID": "4", "LatestEntryTimeStamp": "2023-09-27T14:19:30+00:00" } }, "OverWritePolicy": "WrapsWhenFull" }





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries

Output example:

Copy Copied! { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries", "@odata.type": "#LogEntryCollection.LogEntryCollection", "Description": "Collection of System Event Log Entries", "Members": [ { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/1", "@odata.type": "#LogEntry.v1_9_0.LogEntry", "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/1/attachment", "Created": "2023-09-27T14:18:39+00:00", "EntryType": "Event", "Id": "1", "Message": "12V_ATX sensor crossed a warning low threshold going low. Reading=6.048000 Threshold=10.400000.", "MessageArgs": [ "12V_ATX", "6.048000", "10.400000" ], "MessageId": "OpenBMC.0.1.SensorThresholdWarningLowGoingLow", "Name": "System Event Log Entry", "Resolution": "", "Resolved": false, "Severity": "OK" } … ], "Members@odata.count": 1, "Name": "System Event Log Entries" }





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X POST https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Actions/LogService.ClearLog

Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/SEL/

Output example:

Copy Copied! { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/SEL", "@odata.type": "#LogService.v1_1_0.LogService", "Actions": { "#LogService.ClearLog": { "target": "/redfish/v1/Systems/Bluefield/LogServices/SEL/Actions/LogService.ClearLog" } }, "DateTime": "2024-07-18T10:54:52+00:00", "DateTimeLocalOffset": "+00:00", "Description": "IPMI SEL Service", "Entries": { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/SEL/Entries" }, "Id": "SEL", "Name": "SEL Log Service", "OverWritePolicy": "WrapsWhenFull" }





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/SEL/Entries

Output example:

Copy Copied! { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/SEL/Entries", "@odata.type": "#LogEntryCollection.LogEntryCollection", "Description": "Collection of System Event Log Entries", "Members": [ { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/SEL/Entries/1", "@odata.type": "#LogEntry.v1_13_0.LogEntry", "Created": "2024-07-16T15:34:32+00:00", "EntryType": "SEL", "Id": "1", "Message": "12V_ATX sensor crossed a warning low threshold going low. Reading=6.048000 Threshold=10.400000.", "MessageArgs": [ "12V_ATX", "6.048000", "10.400000" ], "MessageId": "OpenBMC.0.1.SensorThresholdWarningLowGoingLow", "Name": "System Event Log Entry", "Resolution": "Check the sensor or subsystem for errors.", "Resolved": false, "Severity": "OK" }, … ], "Members@odata.count": 22, "Name": "System Event Log Entries" }





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X POST https://<bmc_ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Actions/LogService.ClearLog





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X POST https://<bmc_ip>/redfish/v1/Managers/Bluefield_BMC/Actions/Oem/Nvidia/SelCapacity -d '{"ErrorInfoCap":300 }'





Copy Copied! curl -k -u root:'<password>' -H 'Content-Type: application/json' -X GET https://<bmc_ip>/redfish/v1/Managers/Bluefield_BMC/Oem/Nvidia/SelCapacity

Example output:

Copy Copied! { "ErrorInfoCap": 300 }

The following table lists the command to use to view event logs:

Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel





Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel list





Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel elist





Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel save <filename>





Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN sel clear





The capacity is a 4-byte value, and the byte order is from low to high as shown in command example.

To set the capacity to 300 lines, the value should be 0x2c 0x01 0x00 0x00 :

Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN raw 0x0a 0x4a <capacity[ 0 : 7 ]> <capacity[ 8 : 15 ]> <capacity[ 16 : 23 ]> <capacity[ 24 : 31 ]>





Copy Copied! ipmitool -C 17 -I lanplus -H <bmc_ip> -U ADMIN -P ADMIN raw 0x0a 0x4b

The following subsections detail the messages which are added to the BMC SEL and the scenarios that trigger them.

Messages are added to the BMC SEL while the BlueField UEFI is booting which describe the status of the UEFI boot.

SEL messages:

SMBus initialization

PCI resource configuration

System boot initiated

Example:

Copy Copied! SEL Record ID : 0037 Record Type : 02 Timestamp : 06:36:06 UTC 06:36:06 UTC Generator ID : 0001 EvM Revision : 04 Sensor Type : System Firmware Sensor Number : 06 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : c207ff Description : PCI resource configuration





Messages are added to the SEL in case of a change in the status of the QSFP cables. The messages describe the event and status of the sensor.

List of QSFP sensors:

P0_link – the QSFP 0 cable status

P1_link – the QSFP 1 cable status

SEL messages:

Config Error – the QSFP cable is down

Connected – the QSFP cable is up

Example:

Copy Copied! SEL Record ID : 003e Record Type : 02 Timestamp : 07:08:28 UTC 07:08:28 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Cable / Interconnect Sensor Number : 00 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data (RAW) : 010f0f Event Interpretation : Missing Description : Config Error Sensor ID : p0_link (0x0) Entity ID : 31.1 Sensor Type (Discrete): Cable / Interconnect States Asserted : Cable / Interconnect [Config Error]





Messages are added to the SEL if temperature sensors detect a value higher than the sensor thresholds. The messages include a description of the event, BlueField FRU device description, BlueField BMC device description, and the status of the sensor.

List of temperature sensors:

bluefield_temp – Bluefield temperature

p0_temp – QSFP 0 cable temperature

p1_temp – QSFP 1 cable temperature

SEL messages:

Upper Critical going high – crossing a upper critical threshold

Upper Non-critical going high – crossing a upper non-critical threshold

Lower Critical going low – crossing a lower critical threshold

Lower Non-critical going low – crossing a lower non-critical threshold

Example:

Collapse Source Copy Copied! SEL Record ID : 003c Record Type : 02 Timestamp : 07:01:06 UTC 07:01:06 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Temperature Sensor Number : 03 Event Type : Threshold Event Direction : Assertion Event Event Data (RAW) : 592802 Trigger Reading : 40.000degrees C Trigger Threshold : 2.000degrees C Description : Upper Critical going high Sensor ID : p0_temp (0x3) Entity ID : 0.1 Sensor Type (Threshold) : Temperature Sensor Reading : 40 (+/- 0) degrees C Status : ok Lower Non-Recoverable : na Lower Critical : -5.000 Lower Non-Critical : 0.000 Upper Non-Critical : 70.000 Upper Critical : 75.000 Upper Non-Recoverable : na Positive Hysteresis : Unspecified Negative Hysteresis : Unspecified Assertion Events : Event Enable : Event Messages Disabled Assertions Enabled : lnc- lcr- unc+ ucr+ Deassertions Enabled : lnc+ lcr+ unc- ucr- FRU Device Description : Nvidia-BMCMezz (ID 169) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : Nvidia-BMCMezz Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA FRU Device Description : BlueField-3 Smar (ID 250) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : BlueField-3 SmartNIC Main Card Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA Product Manufacturer : Nvidia Product Name : BlueField-3 SmartNIC Main Card Product Part Number : 900-9D3B6-00CV-AAA Product Version : A3 Product Serial : MT2251XZ02W5 Product Asset Tag : 900-9D3B6-00CV-AAA

Messages are added to the SEL if the sensor voltage crosses the sensor's thresholds. The messages include a description of the event, BlueField FRU device description, BlueField BMC device description, and the status of the sensor.

List of ADC sensors:

1V_BMC

1_2V_BMC

1_8V

1_8V_BMC

2_5V

3_3V

3_3V_RGM

5V

12V_ATX

12V_PCIe

DVDD

HVDD

VDD

VDDQ

VDD_CPU_L

VDD_CPU_R

SEL messages:

Upper Non-critical going high – crossing a upper non-critical threshold

Lower Non-critical going low – crossing a lower non-critical threshold

Example:

Collapse Source Copy Copied! SEL Record ID : 0042 Record Type : 02 Timestamp : 09:20:50 UTC 09:20:50 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Voltage Sensor Number : 06 Event Type : Threshold Event Direction : Assertion Event Event Data (RAW) : 50a9ff Trigger Reading : 1.200Volts Trigger Threshold : 1.810Volts Description : Lower Non-critical going low Sensor ID : 1_2V_BMC (0x6) Entity ID : 0.1 Sensor Type (Threshold) : Voltage Sensor Reading : 1.200 (+/- 0) Volts Status : ok Lower Non-Recoverable : na Lower Critical : na Lower Non-Critical : 1.143 Upper Non-Critical : 1.257 Upper Critical : na Upper Non-Recoverable : na Positive Hysteresis : Unspecified Negative Hysteresis : Unspecified Assertion Events : Event Enable : Event Messages Disabled Assertions Enabled : lnc- unc+ Deassertions Enabled : lnc+ unc- FRU Device Description : Nvidia-BMCMezz (ID 169) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : Nvidia-BMCMezz Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA FRU Device Description : BlueField-3 Smar (ID 250) Board Mfg Date : Tue Jan 3 23:16:00 2023 UTC Board Mfg : Nvidia Board Product : BlueField-3 SmartNIC Main Card Board Serial : MT2251XZ02W5 Board Part Number : 900-9D3B6-00CV-AAA Product Manufacturer : Nvidia Product Name : BlueField-3 SmartNIC Main Card Product Part Number : 900-9D3B6-00CV-AAA Product Version : A3 Product Serial : MT2251XZ02W5 Product Asset Tag : 900-9D3B6-00CV-AAA





SEL messages:

Copy Copied! System boot initiated Initiated by warm reset

Example:

Copy Copied! SEL Record ID : 0001 Record Type : 02 Timestamp : 01/10/24 14:25:07 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : System Boot Initiated Sensor Number : 17 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 020000 Description : Initiated by warm reset





SEL messages:

Copy Copied! System boot initiated Initiated by hard reset

Example:

Copy Copied! SEL Record ID : 0008 Record Type : 02 Timestamp : 01/10/24 14:33:01 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : System Boot Initiated Sensor Number : 17 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 010000 Description : Initiated by hard reset

If the host does not assert the PERST / ALL_STANDBY signal, causing the reset to fail, the following SEL messages can be observed:

Copy Copied! Power Unit Failure detected

Example:

Copy Copied! SEL Record ID : 0004 Record Type : 02 Timestamp : 07/25/24 13:32:18 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 1b Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 060000 Description : Failure detected





SEL messages:

Copy Copied! OS Critical Stop OS graceful shutdown

Example:

Copy Copied! SEL Record ID : 000a Record Type : 02 Timestamp : 01/10/24 14:34:45 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : OS Critical Stop Sensor Number : 18 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 030000 Description : OS graceful shutdown





SEL messages:

Copy Copied! Firmware or software change success

Example:

Copy Copied! SEL Record ID : 0007 Record Type : 02 Timestamp : 06/11/24 14:03:02 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Version Change Sensor Number : 18 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : c70000 Description : Firmware or software change success





SEL messages:

Copy Copied! Firmware or software change success, Mngmt SW agent change

Example:

Copy Copied! SEL Record ID : 0010 Record Type : 02 Timestamp : 01/10/24 15:48:01 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Version Change Sensor Number : 19 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : c70e00 Description : Firmware or software change success, Mngmt SW agent change

SEL messages:

Copy Copied! Uncorrectable ECC

Example:

Copy Copied! SEL Record ID : 024a Record Type : 02 Timestamp : 06/20/24 15:54:58 UTC Generator ID : 0020 EvM Revision : 04 Sensor Type : Memory Sensor Number : 17 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 010000 Description : Uncorrectable ECC





SEL messages:

Copy Copied! Correctable ECC

Example: