RAS
RAS is supported on NVIDIA® BlueField®-3 and higher only.
Reliability, availability and serviceability (RAS) reduces and avoids unplanned outages because:
Errors can be detected and corrected by the hardware before they cause system failures
Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.
Errors can be predicted ahead of time to allow replacement of defective hardware
SDEI NMI watchdog is supported on Linux OS openEuler and Anolis only.
Software-delegated exception interface (SDEI) non-maskable interrupt (NMI) watchdog offers a way to detect Linux kernel hard lockup (e.g., hang in interrupt handler). It uses high-priority per-core secure timer as a keepalive mechanism and prints the call stack information when Linux kernel is stuck.
The SDEI NMI watchdog is an implementation for the "Software watchdog timer" as described in section 2.1.1 "Typical use cases" of the Software Delegated Exception Interface (SDEI) Platform Design Document.
The b oot error record table (BERT) is used to report errors that occurred and were unhandled in a previous boot. On the subsequent boot, the OS reports the error using the common platform error record (CPER) format.
For more information on BERT, refer to section 18.3 of the ACPI specification.
For more information on CPER, refer to appendix N of the UEFI specification.
In BlueField-3, BERT reports all RShim log messages. For example, ASSERT
s which occurred in the UEFI generate a Linux error report in the subsequent boot.
Example of a boot where ASSERT
s occurred:
INFO[BL2]: start
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle GA Secured
INFO[BL31]: runtime
INFO[UEFI]: UPVS valid
WARN[UEFI]: UPVS full
WARN[UEFI]: UPVS reclaim
WARN[UEFI]: Var reclaim
WARN[UEFI]: Var reclaim done
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: total updates: 1
INFO[UEFI]: PMI: updates completed, status 0
ASSERT[UEFI]: MdePkg/Library/BaseLib/String.c: 173
ASSERT[UEFI]: PC=0xF56706B8
ASSERT[UEFI]: PC=0xF5668A2C
ASSERT[UEFI]: PC=0xF567E414
ASSERT[UEFI]: PC=0xFDE33DA4
ASSERT[UEFI]: PC=0x88009338
ASSERT[UEFI]: PC=0x880094D4
INFO[UEFI]: PCIe enum start
INFO[UEFI]: PCIe enum end
INFO[UEFI]: exit Boot Service
INFO[MISC]: Found bf.cfg
INFO[MISC]: Ubuntu installation started
INFO[MISC]: Installing OS image
INFO[MISC]: Changing the default password for user ubuntu
INFO[MISC]: Running bfb_modify_os from bf.cfg
INFO[MISC]: ===================== bfb_modify_os =====================
INFO[MISC]: Installation finished
Example print out of the next boot:
[ 0.736968] [Hardware Error]: event severity: fatal
[ 0.736971] [Hardware Error]: Error 0, type: fatal
[ 0.736976] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.736978] [Hardware Error]: section length: 0x33
[ 0.736983] [Hardware Error]: 00000000: 53534120 5b545245 49464555 4d203a5d ASSERT[UEFI]: M
[ 0.736986] [Hardware Error]: 00000010: 6b506564 694c2f67 72617262 61422f79 dePkg/Library/Ba
[ 0.736989] [Hardware Error]: 00000020: 694c6573 74532f62 676e6972 203a632e seLib/String.c:
[ 0.736991] [Hardware Error]: 00000030: 31 37 33 173
[ 0.736993] [Hardware Error]: Error 1, type: fatal
[ 0.736995] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.736996] [Hardware Error]: section length: 0x1c
[ 0.737000] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737002] [Hardware Error]: 00000010: 78303d43 37363546 38423630 C=0xF56706B8
[ 0.737004] [Hardware Error]: Error 2, type: fatal
[ 0.737005] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.737007] [Hardware Error]: section length: 0x1c
[ 0.737009] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737012] [Hardware Error]: 00000010: 78303d43 36363546 43324138 C=0xF5668A2C
[ 0.737013] [Hardware Error]: Error 3, type: fatal
[ 0.737015] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.737016] [Hardware Error]: section length: 0x1c
[ 0.737019] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737021] [Hardware Error]: 00000010: 78303d43 37363546 34313445 C=0xF567E414
[ 0.737022] [Hardware Error]: Error 4, type: fatal
[ 0.737024] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.737025] [Hardware Error]: section length: 0x1c
[ 0.737028] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737030] [Hardware Error]: 00000010: 78303d43 33454446 34414433 C=0xFDE33DA4
[ 0.737031] [Hardware Error]: Error 5, type: fatal
[ 0.737033] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.737035] [Hardware Error]: section length: 0x1c
[ 0.737037] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737040] [Hardware Error]: 00000010: 78303d43 30303838 38333339 C=0x88009338
[ 0.737041] [Hardware Error]: Error 6, type: fatal
[ 0.737043] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[ 0.737044] [Hardware Error]: section length: 0x1c
[ 0.737060] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P
[ 0.737062] [Hardware Error]: 00000010: 78303d43 30303838 34443439 C=0x880094D4
On BlueField-3, the h ardware error source t able (HEST) is used for reporting errors to the OS. The error injection (EINJ) table is used for injecting errors.
For more information on HEST, refer to section 18.3.2 of the ACPI specification.
By default, Disable HEST
is FALSE
, delegating error handling to the OS. Setting Disable HEST
to TRUE
using Redfish or the UEFI menu, shifts error handling to the BlueField Arm firmware which would limit the chance for the error to propagate through the system.
Disabling HEST via UEFI Menu
Open these SSH terminal windows to the host:
From Terminal 1, open the OOB console via the RShim console device file. Example:
$ sudo minicom --color=on -D /dev/rshim0/console
115200
From Terminal 2, reset Arm from the host:
$ sudo sh -c
"echo 'SW_RESET 1' > /dev/rshim0/misc"
To enter the UEFI menu, press
ESC
in the OOB terminal when the following screen appears:Navigate the UEFI menu to
Device Manager
>System Configuration
which contains theDisable HEST
setting.Navigate down to
Disable HEST
setting and pressSpace
to toggle theDisable HEST
value and enable it.Press
F10
thenY
to save the new setting.UEFI reboots to apply the new setting.
Disabling HEST via Redfish
DisableHEST
can be configured via the Redfish REST interface with the REST server running on the BMC. The process involves using curl
commands from the host to send REST commands to the BMC's Redfish server.
Run the following steps on the host terminal:
Install the prerequisite software:
$ sudo yum install curl # or sudo apt install curl $ sudo yum install jq # or sudo apt install jq
Check the current
DisableHEST
value:$ curl -sku root:
'<password>'
-H'content-type: application/json'
-X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios | jq '.Attributes|{DisableHEST}'
Example output:
{
"DisableHEST"
:false
}Set
DisableHEST
totrue
:$ curl -sku root:
'<password>'
-H'content-type: application/json'
-d'{ "Attributes": { "DisableHEST": true } }'
-X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings | jq '."@Message.ExtendedInfo"[0]|{Message}'
Restart BlueField-3 to apply the pending configuration changes:
$ sudo sh -c
"echo 'SW_RESET 1' > /dev/rshim0/misc"
Memory Errors
BlueField-3 has RAS support for memory errors which can be one of these types:
Single-bit errors – Correctable errors (CE) which are non-fatal and are corrected by hardware
Double-bit errors which could be:
Non-fatal errors – Uncorrectable recoverable errors which do not interrupt services
Fatal errors – Uncorrectable errors which cause the system to abort or reboot
Processor Errors
BlueField-3 supports RAS for processor errors which can be either:
Correctable errors
Uncorrectable non fatal errors
Uncorrectable fatal errors
Error Injection
BlueField-3 error handling may be tested by injecting errors using the following methods:
Error injection (EINJ) ACPI table (for memory and processor errors)
ras-tools
(for memory and processor errors)Noteras-tools
has been verified to run with Anolis only. Use it with other OSs at your own discretion.
EINJ Table
EINJ table is the standard way of injecting errors on BlueField-3 Linux distributions.
For more information on EINJ, refer to section 18.6.1 of the ACPI specification.
Error Injection Commands
In Linux, the einj
driver creates a sysfs
for injecting errors. To load einj
on BlueField-3:
# modprobe einj
The EINJ sysfs is:
# cd /sys/kernel/debug/apei/einj
The directory contains the following files:
[root@localhost
~]# cd /sys/kernel/debug/apei/einj/
[root@localhost
einj]# ls
available_error_type error_type notrigger param2 param4
error_inject flags param1 param3
Where:
Parameter | Definition |
| Contains the available error types:
|
| The type of error to inject (according to |
| Physical address to inject an error to Info
Relevant to memory error injection only. |
| Physical address mask of Info
Relevant to memory error injection only. |
| The number of the core to inject an error to Info
Relevant to processor error injection only. |
| Set to |
| Is set to |
Injecting Memory Errors
To inject memory errors:
Specify the physical address you want to inject the error to in
param1
and its address mask inparam2
:# echo
0x400000000
> param1 # echo0xfffffffffffff000
> param2InfoThis depends on the size of memory available.
Specify the error type to inject. To inject a memory correctable error:
# echo
0x8
> error_type # echo1
> error_inject
Expected Memory Error Report Format
Correctable Memory Error
When a correctable memory error is injected, the BlueField console displays output similar to the following:
# dmesg
ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91
[ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 234.638588] {1}[Hardware Error]: event severity: corrected
[ 234.638590] {1}[Hardware Error]: Error 0, type: corrected
[ 234.638591] {1}[Hardware Error]: section_type: memory error
[ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080
[ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480
[ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC
[ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)
The RShim log would show a concise message such as the following:
ERR[BL31]: [75512]mss1: C0 single-bit ecc, IRQ[91]
Uncorrectable Non-fatal Memory Error
When an uncorrectable, non-fatal memory error is injected, the BlueField console displays output similar to the following:
ERROR: BL31: MSS1 C0 Double bit
ECC error detected. IR
[ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 219.795253] {1}[Hardware Error]: event severity: recoverable
[ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable
[ 219.806540] {1}[Hardware Error]: section_type: memory error
[ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080
[ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC
[ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 page:0x0 offset:0x80 grain:-281474976710655 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 status(0x0000000000010400): Storage error in DRAM memory)
[ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80
If the non-fatal error targets userspace memory, it is recovered by the OS. If the non-fatal error targets kernel memory, it is marked as fatal by the OS.
Uncorrectable Fatal Memory Error
When an uncorrectable, fatal memory error is injected, the BlueField console displays output similar to the following:
root@localhost:~[ 79.351190] EINJ: Error INJection is initialized.
ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93
[ 79.636470] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 79.636476] {1}[Hardware Error]: event severity: fatal
[ 79.636478] {1}[Hardware Error]: Error 0, type: fatal
[ 79.636481] {1}[Hardware Error]: section_type: memory error
[ 79.636482] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 79.636483] {1}[Hardware Error]: physical_address: 0x000000016cb48000
[ 79.636484] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 79.636489] {1}[Hardware Error]: module: 0 rank: 0 bank: 1539 bank_group: 6 row: 15149 column: 2 bit_position: 0
[ 79.636491] {1}[Hardware Error]: error_type: 14, scrub uncorrected error
[ 79.636495] Kernel panic - not syncing: Fatal hardware error!
[ 79.636499] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.15.0-1053.55.24.g9cc17fe-bluefield #g9cc17fe
[ 79.636502] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.9.1.13411 Nov 18 2024
[ 79.636505] Call trace:
[ 79.636505] dump_backtrace+0x0/0x200
[ 79.636515] show_stack+0x20/0x2c
[ 79.636517] dump_stack_lvl+0x68/0x84
[ 79.636524] dump_stack+0x18/0x34
[ 79.636527] panic+0x1b0/0x3a8
[ 79.636529] __raw_spin_lock_irqsave.constprop.0+0x0/0xcc
[ 79.636533] ghes_in_nmi_queue_one_entry+0x204/0x300
[ 79.636535] ghes_sdei_critical_callback+0x58/0xd0
[ 79.636536] sdei_event_handler+0x28/0x90
[ 79.636541] do_sdei_event+0xa4/0x180
[ 79.636544] __sdei_handler+0x5c/0xa0
[ 79.636548] __sdei_asm_handler+0xe8/0x19c
[ 79.636551] cpu_do_idle+0x14/0x74
[ 79.636553] default_idle_call+0x44/0x150
[ 79.636555] cpuidle_idle_call+0x174/0x200
[ 79.636559] do_idle+0xac/0x100
[ 79.636561] cpu_startup_entry+0x30/0x70
[ 79.636563] rest_init+0xec/0x120
[ 79.636565] arch_call_rest_init+0x18/0x24
[ 79.636569] start_kernel+0x4b4/0x4ec
[ 79.636570] __primary_switched+0xbc/0xc4
[ 79.636573] SMP: stopping secondary CPUs
[ 79.636590] Kernel Offset: 0x20fb357f0000 from 0xffff800008000000
[ 79.636592] PHYS_OFFSET: 0x80000000
[ 79.636592] CPU features: 0x0,000005c1,a3332e5a
[ 79.636595] Memory Limit: none
[ 80.752101] pstore: crypto_comp_compress failed, ret = -22!
Nvidia BlueField-3 rev1 BL1 V1.0
INFO: psc supervisor init.
INFO: psc_irq_init...
INFO: force_crs_enable=0 pcr.lock0 = 1, time = 60105
INFO: enter idle task.
NOTICE: Running as 9009D3D400ENHAA system
A double bit error should abort or reboot a system depending on the OS. In this output, you can see that Ubuntu 5.15 reboots BlueField. In Anolis, it just aborts and does not reboot.
The RShim log should also indicate that a double ECC error has occurred.
Injecting Processor Errors
To inject processor errors:
Specify the core you would like to inject your error to in
param3
:# echo
7
> param3To list all core IDs available, use
cat /proc/cpuinfo
.Specify the error type to inject. To inject a processor correctable error:
# echo
0x1
> error_type # echo1
> error_inject
Expected Processor Error Report Format
Correctable Processor Error
When a correctable processor error is injected, the BlueField console displays output similar to the following:
# dmesg
[ 4098.340991] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 4098.340995] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 4098.340996] {1}[Hardware Error]: event severity: corrected
[ 4098.340998] {1}[Hardware Error]: Error 0, type: corrected
[ 4098.340999] {1}[Hardware Error]: section_type: ARM processor error
[ 4098.341001] {1}[Hardware Error]: MIDR: 0x00000000410fd421
[ 4098.341002] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010000
[ 4098.341003] {1}[Hardware Error]: running state: 0x1
[ 4098.341004] {1}[Hardware Error]: Power State Coordination Interface state: 0
[ 4098.341005] {1}[Hardware Error]: Error info structure 0:
[ 4098.341005] {1}[Hardware Error]: num errors: 2
[ 4098.341006] {1}[Hardware Error]: error_type: 0, cache error
Uncorrectable Non-fatal Processor Error
When an uncorrectable, non-fatal processor error is injected, the BlueField console displays output similar to the following:
[ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 5008.520195] {2}[Hardware Error]: event severity: recoverable
[ 5008.525837] {2}[Hardware Error]: Error 0, type: recoverable
[ 5008.531478] {2}[Hardware Error]: section_type: ARM processor error
[ 5008.537813] {2}[Hardware Error]: MIDR: 0x00000000410fd421
[ 5008.543366] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081070100
[ 5008.552042] {2}[Hardware Error]: running state: 0x1
[ 5008.557074] {2}[Hardware Error]: Power State Coordination Interface state: 0
[ 5008.564275] {2}[Hardware Error]: Error info structure 0:
[ 5008.569741] {2}[Hardware Error]: num errors: 2
[ 5008.574340] {2}[Hardware Error]: error_type: 0, cache error
Uncorrectable Fatal Processor Error
When an uncorrectable, fatal processor error is injected, the BlueField console displays output similar to the following:
[ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 5008.520195] {2}[Hardware Error]: event severity: fatal
[ 5008.525837] {2}[Hardware Error]: Error 0, type: fatal
[ 5008.531478] {2}[Hardware Error]: section_type: ARM processor error
[ 5008.537813] {2}[Hardware Error]: MIDR: 0x00000000410fd421
[ 5008.543366] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081070100
[ 5008.552042] {2}[Hardware Error]: running state: 0x1
[ 5008.557074] {2}[Hardware Error]: Power State Coordination Interface state: 0
[ 5008.564275] {2}[Hardware Error]: Error info structure 0:
[ 5008.569741] {2}[Hardware Error]: num errors: 2
[ 5008.574340] {2}[Hardware Error]: error_type: 0, cache error
A double bit error should abort and reboot the system. In Anolis, the system aborts and reboots after a few seconds.
Ras-tools
ras-tools
has been verified for Anolis OS only. Use it with on other OSs at your own discretion.
Load the
einj
driver:# modprobe einj
Install
ras-tools
:# yum install ras-tools
Setting Correctable Errors Thresholds
By default, the correctable errors threshold (
CeThreshold
) value is 5000 (i.e., CE is only reported to the user at the 5000th error). This means that the user would not see any CE error indication in the RShim log nor the Arm console as long as the amount of errors injected is not a multiple of 5000.
The following procedure demonstrates how CeThreshold
can be modified using Redfish:
Check current BIOS settings:
curl -k -u root:
'<password>'
-H'content-type: application/json'
-X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios
Example output:
"Attributes"
: {"Boot Partition Protection"
:false
,"CeThreshold"
:3
,"CurrentUefiPassword"
:""
,"DateTime"
:"2024-06-26T15:13:45Z"
,"DefaultPasswordPolicy"
:true
,"Disable PCIe"
:false
,"Disable SPMI"
:false
,"Disable TMFF"
:false
,"EmmcWipe"
:false
,"Enable 2nd eMMC"
:false
,"Enable OP-TEE"
:false
,"Enable SMMU"
:true
,"Field Mode"
:false
,"Host Privilege Level"
:"Privileged"
,"Internal CPU Model"
:"Embedded"
,"LegacyPasswordEnable"
:false
,"NicMode"
:"DpuMode"
,"NvmeWipe"
:false
,"OsArgs"
:""
,"ResetEfiVars"
:false
,"SPCR UART"
:"Disabled"
,"UefiArgs"
:""
,"UefiPassword"
:""
},Change
CeThreshold
to 5:curl -k -u root:
'<password>'
-H'content-type: application/json'
-d'{ "Attributes": { "CeThreshold":5 } }'
-X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings
Check the new attribute settings:
curl -k -u root:
'<password>'
-H'content-type: application/json'
-X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios
Example output:
{
"@odata.id"
:"/redfish/v1/Systems/Bluefield/Bios/Settings"
,"@odata.type"
:"#Bios.v1_2_0.Bios"
,"Attributes"
: {"CeThreshold"
:5
},"Description"
:"BIOS Settings"
,"Id"
:"BIOS_Settings"
,"Name"
:"BIOS Configuration"
}Reboot Arm. For example:
dpu> echo
"SW_RESET 1"
> /dev/rshim0/miscCheck that
CeThreshold
has been updated in the BIOS:curl -k -u root:
'<password>'
-H'content-type: application/json'
-X GET https://<bmc_ip>/redfis/v1/Systems/Bluefield/Bios
"Attributes"
: {"Boot Partition Protection"
:false
,"CeThreshold"
:5
,"CurrentUefiPassword"
:""
,"DateTime"
:"2024-06-26T15:13:45Z"
,"DefaultPasswordPolicy"
:true
,"Disable PCIe"
:false
,"Disable SPMI"
:false
,"Disable TMFF"
:false
,"EmmcWipe"
:false
,"Enable 2nd eMMC"
:false
,"Enable OP-TEE"
:false
,"Enable SMMU"
:true
,"Field Mode"
:false
,"Host Privilege Level"
:"Privileged"
,"Internal CPU Model"
:"Embedded"
,"LegacyPasswordEnable"
:false
,"NicMode"
:"DpuMode"
,"NvmeWipe"
:false
,"OsArgs"
:""
,"ResetEfiVars"
:false
,"SPCR UART"
:"Disabled"
,"UefiArgs"
:""
,"UefiPassword"
:""
},
BlueField Arm only supports memory and CPU error injection and handling.
BlueField-3 supports the handling of the following NIC firmware errors:
PCIe errors
NoteC urrently, error injection is not supported.
NIC RAM errors
NoteC urrently, error injection is not supported.
Network errors
NIC Network Errors
CRC errors can be forced during traffic using the mlxreg
command to write PTER (Port Transmit Errors Register) access register.
See section 30.4.11, "PTER - Port Transmit Errors Register" in the NVIDIA Adapters Programmer's Reference Manual .
{
"@odata.id"
: "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17"
,
"@odata.type"
: "#LogEntry.v1_15_0.LogEntry"
,
"AdditionalDataURI"
: "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment"
,
"CPER"
: {
"NotificationType"
: "00000000-0000-0000-0000-000000000000"
,
"Oem"
: {
"Nvidia"
: {
"@odata.type"
: "#NvidiaCPER.v1_0_0.NvidiaCPER"
,
"Nvidia"
: {
"ErrorInstance"
: 0
,
"ErrorType"
: 2
,
"InstanceBase"
: 0
,
"RegisterCount"
: 3
,
"Registers"
: [
{
"Address"
: 1
,
"Value"
: <icrc errors>
},
{
"Address"
: 2
,
"Value"
: <vcrc errors>
}
{
"Address"
: 3
,
"Value"
: <fc/eth crc errors>
}
],
"Severity"
: {
"Code"
: 0
,
"Name"
: <severity string>
},
"Signature"
: "NBU"
,
"Socket"
: 0
}
}
},
"SectionType"
: "6d5244f2-2712-11ec-bea7-cb3fdb95c786"
},
"Created"
: "2024-11-19T18:37:28+00:00"
,
"DiagnosticDataType"
: "CPERSection"
,
"EntryType"
: "Event"
,
"Id"
: "17"
,
"Message"
: "A platform error occurred."
,
"MessageArgs"
: [],
"MessageId"
: "Platform.1.0.PlatformError"
,
"Name"
: "System Event Log Entry"
,
"Resolution"
: "Check additional diagnostic data if available."
,
"Resolved"
: false
,
"Severity"
: "Critical"
},
Expected NIC Error Report Format
If a NIC hardware error occurs, the report is only sent to the BMC. See section "BMC SEL" for more details.
The report would look similar to the following:
{
"@odata.id"
: "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17"
,
"@odata.type"
: "#LogEntry.v1_15_0.LogEntry"
,
"AdditionalDataURI"
: "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment"
,
"CPER"
: {
"NotificationType"
: "00000000-0000-0000-0000-000000000000"
,
"Oem"
: {
"Nvidia"
: {
"@odata.type"
: "#NvidiaCPER.v1_0_0.NvidiaCPER"
,
"Nvidia"
: {
"ErrorInstance"
: 0
,
"ErrorType"
: 1
,
"InstanceBase"
: 0
,
"RegisterCount"
: 2
,
"Registers"
: [
{
"Address"
: 1
,
"Value"
: <reg address>
},
{
"Address"
: 2
,
"Value"
: <value>
}
],
"Severity"
: {
"Code"
: <severity value>,
"Name"
: <severity string>
},
"Signature"
: "NBU"
,
"Socket"
: 0
}
}
},
"SectionType"
: "6d5244f2-2712-11ec-bea7-cb3fdb95c786"
},
"Created"
: "2024-11-19T18:37:28+00:00"
,
"DiagnosticDataType"
: "CPERSection"
,
"EntryType"
: "Event"
,
"Id"
: "17"
,
"Message"
: "A platform error occurred."
,
"MessageArgs"
: [],
"MessageId"
: "Platform.1.0.PlatformError"
,
"Name"
: "System Event Log Entry"
,
"Resolution"
: "Check additional diagnostic data if available."
,
"Resolved"
: false
,
"Severity"
: "Critical"
},
Querying the SEL log can be done using IPMI or Redfish:
Redfish
curl -k -u root:
'<password>'
-X GET https://<bmc-ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries
in"Members"
: {"@odata.id"
:"/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694"
,"@odata.type"
:"#LogEntry.v1_15_0.LogEntry"
,"AdditionalDataURI"
:"/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694/attachment"
,"Created"
:"2024-11-20T16:36:41+00:00"
,"EntryType"
:"Event"
,"Id"
:"694"
,"Message"
:"SEL event for single bit ECC"
,"Modified"
:"2024-11-20T16:36:41+00:00"
,"Name"
:"System Event Log Entry"
,"Resolved"
:false
,"Severity"
:"OK"
}, {"@odata.id"
:"/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696"
,"@odata.type"
:"#LogEntry.v1_15_0.LogEntry"
,"AdditionalDataURI"
:"/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696/attachment"
,"Created"
:"2024-11-20T16:41:54+00:00"
,"EntryType"
:"Event"
,"Id"
:"696"
,"Message"
:"SEL event for multi bit ECC"
,"Modified"
:"2024-11-20T16:41:54+00:00"
,"Name"
:"System Event Log Entry"
,"Resolved"
:false
,"Severity"
:"OK"
},IPMI
# ipmitool sel list
Example output:
2b6 |
11
/20
/24
|16
:36
:41
UTC | Memory #0x19
| Correctable ECC | Asserted 2b8 |11
/20
/24
|16
:41
:54
UTC | Memory #0x19
| Uncorrectable ECC | Asserted
The bftraining_results
script prints the training parameter results with results reported separately for each memory channel.
Example output:
dpu# bftraining_results
Memory controller 0 Channel 0
Read Data worst timing margin 23
Write Data worst timing margin 21
Read Data worst Vref margin 43
Write Data worst Vref margin 39
TX CS worst timing margin 58
TX CA worst timing margin 55
Memory controller 0 Channel 1
Unpopulated
Memory controller 1 Channel 0
Read Data worst timing margin 23
Write Data worst timing margin 22
Read Data worst Vref margin 41
Write Data worst Vref margin 40
TX CS worst timing margin 59
TX CA worst timing margin 55
Memory controller 1 Channel 1
Unpopulated