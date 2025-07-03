BlueField-3 has RAS support for memory errors which can be one of these types:

Single-bit errors – Correctable errors (CE) which are non-fatal and are corrected by hardware

Double-bit errors which could be: Non-fatal errors – Uncorrectable recoverable errors which do not interrupt services Fatal errors – Uncorrectable errors which cause the system to abort or reboot



BlueField-3 supports RAS for processor errors which can be either:

Correctable errors

Uncorrectable non fatal errors

Uncorrectable fatal errors

BlueField-3 error handling may be tested by injecting errors using the following methods:

Error injection (EINJ) ACPI table (for memory and processor errors)

ras-tools (for memory and processor errors) Note ras-tools has been verified to run with Anolis only. Use it with other OSs at your own discretion.

EINJ table is the standard way of injecting errors on BlueField-3 Linux distributions.

Info For more information on EINJ, refer to section 18.6.1 of the ACPI specification.

In Linux, the einj driver creates a sysfs for injecting errors. To load einj on BlueField-3:

Copy Copied! # modprobe einj

The EINJ sysfs is:

Copy Copied! # cd /sys/kernel/debug/apei/einj

The directory contains the following files:

Copy Copied! [root @localhost ~]# cd /sys/kernel/debug/apei/einj/ [root @localhost einj]# ls available_error_type error_type notrigger param2 param4 error_inject flags param1 param3

Where:

Parameter Definition available_error_type Contains the available error types: 0x00000001 – Processor correctable error

0x00000002 – Processor uncorrectable non-fatal error

0x00000004 – Processor uncorrectable fatal error

0x00000008 – Memory correctable error

0x00000010 – Memory uncorrectable non-fatal error

0x00000020 – Memory uncorrectable fatal error error_type The type of error to inject (according to available_error_type ) param1 Physical address to inject an error to Info Relevant to memory error injection only. param2 Physical address mask of param1 Info Relevant to memory error injection only. param3 The number of the core to inject an error to Info Relevant to processor error injection only. error_inject Set to 1 to inject error notrigger Is set to 0 by default. Changing this to 1 stops the error from triggering automatically. The user would need to trigger the error using a simple access to the CPU, memory location, or a device which is the target of the error injection.

To inject memory errors:

Specify the physical address you want to inject the error to in param1 and its address mask in param2 : Copy Copied! # echo 0x400000000 > param1 # echo 0xfffffffffffff000 > param2 Info This depends on the size of memory available. Specify the error type to inject. To inject a memory correctable error: Copy Copied! # echo 0x8 > error_type # echo 1 > error_inject

Correctable Memory Error

When a correctable memory error is injected, the BlueField console displays output similar to the following:

Copy Copied! # dmesg ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91 [ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 234.638588] {1}[Hardware Error]: event severity: corrected [ 234.638590] {1}[Hardware Error]: Error 0, type: corrected [ 234.638591] {1}[Hardware Error]: section_type: memory error [ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480 [ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC [ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)

The RShim log would show a concise message such as the following:

Copy Copied! ERR[BL31]: [75512]mss1: C0 single-bit ecc, IRQ[91]





Uncorrectable Non-fatal Memory Error

When an uncorrectable, non-fatal memory error is injected, the BlueField console displays output similar to the following:

Copy Copied! ERROR: BL31: MSS1 C0 Double bit ECC error detected. IR [ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 219.795253] {1}[Hardware Error]: event severity: recoverable [ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable [ 219.806540] {1}[Hardware Error]: section_type: memory error [ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0 [ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 page:0x0 offset:0x80 grain:-281474976710655 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 status(0x0000000000010400): Storage error in DRAM memory) [ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

If the non-fatal error targets userspace memory, it is recovered by the OS. If the non-fatal error targets kernel memory, it is marked as fatal by the OS.

Uncorrectable Fatal Memory Error

When an uncorrectable, fatal memory error is injected, the BlueField console displays output similar to the following:

Collapse Source Copy Copied! root@localhost:~[ 79.351190] EINJ: Error INJection is initialized. ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93 [ 79.636470] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 79.636476] {1}[Hardware Error]: event severity: fatal [ 79.636478] {1}[Hardware Error]: Error 0, type: fatal [ 79.636481] {1}[Hardware Error]: section_type: memory error [ 79.636482] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 79.636483] {1}[Hardware Error]: physical_address: 0x000000016cb48000 [ 79.636484] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 79.636489] {1}[Hardware Error]: module: 0 rank: 0 bank: 1539 bank_group: 6 row: 15149 column: 2 bit_position: 0 [ 79.636491] {1}[Hardware Error]: error_type: 14, scrub uncorrected error [ 79.636495] Kernel panic - not syncing: Fatal hardware error! [ 79.636499] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.15.0-1053.55.24.g9cc17fe-bluefield #g9cc17fe [ 79.636502] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.9.1.13411 Nov 18 2024 [ 79.636505] Call trace: [ 79.636505] dump_backtrace+0x0/0x200 [ 79.636515] show_stack+0x20/0x2c [ 79.636517] dump_stack_lvl+0x68/0x84 [ 79.636524] dump_stack+0x18/0x34 [ 79.636527] panic+0x1b0/0x3a8 [ 79.636529] __raw_spin_lock_irqsave.constprop.0+0x0/0xcc [ 79.636533] ghes_in_nmi_queue_one_entry+0x204/0x300 [ 79.636535] ghes_sdei_critical_callback+0x58/0xd0 [ 79.636536] sdei_event_handler+0x28/0x90 [ 79.636541] do_sdei_event+0xa4/0x180 [ 79.636544] __sdei_handler+0x5c/0xa0 [ 79.636548] __sdei_asm_handler+0xe8/0x19c [ 79.636551] cpu_do_idle+0x14/0x74 [ 79.636553] default_idle_call+0x44/0x150 [ 79.636555] cpuidle_idle_call+0x174/0x200 [ 79.636559] do_idle+0xac/0x100 [ 79.636561] cpu_startup_entry+0x30/0x70 [ 79.636563] rest_init+0xec/0x120 [ 79.636565] arch_call_rest_init+0x18/0x24 [ 79.636569] start_kernel+0x4b4/0x4ec [ 79.636570] __primary_switched+0xbc/0xc4 [ 79.636573] SMP: stopping secondary CPUs [ 79.636590] Kernel Offset: 0x20fb357f0000 from 0xffff800008000000 [ 79.636592] PHYS_OFFSET: 0x80000000 [ 79.636592] CPU features: 0x0,000005c1,a3332e5a [ 79.636595] Memory Limit: none [ 80.752101] pstore: crypto_comp_compress failed, ret = -22! Nvidia BlueField-3 rev1 BL1 V1.0 INFO: psc supervisor init. INFO: psc_irq_init... INFO: force_crs_enable=0 pcr.lock0 = 1, time = 60105 INFO: enter idle task. NOTICE: Running as 9009D3D400ENHAA system

A double bit error should abort or reboot a system depending on the OS. In this output, you can see that Ubuntu 5.15 reboots BlueField. In Anolis, it just aborts and does not reboot.

The RShim log should also indicate that a double ECC error has occurred.

To inject processor errors:

Specify the core you would like to inject your error to in param3 : Copy Copied! # echo 7 > param3 To list all core IDs available, use cat /proc/cpuinfo . Specify the error type to inject. To inject a processor correctable error: Copy Copied! # echo 0x1 > error_type # echo 1 > error_inject

Correctable Processor Error

When a correctable processor error is injected, the BlueField console displays output similar to the following:

Copy Copied! # dmesg [ 4098.340991] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 4098.340995] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 4098.340996] {1}[Hardware Error]: event severity: corrected [ 4098.340998] {1}[Hardware Error]: Error 0, type: corrected [ 4098.340999] {1}[Hardware Error]: section_type: ARM processor error [ 4098.341001] {1}[Hardware Error]: MIDR: 0x00000000410fd421 [ 4098.341002] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010000 [ 4098.341003] {1}[Hardware Error]: running state: 0x1 [ 4098.341004] {1}[Hardware Error]: Power State Coordination Interface state: 0 [ 4098.341005] {1}[Hardware Error]: Error info structure 0: [ 4098.341005] {1}[Hardware Error]: num errors: 2 [ 4098.341006] {1}[Hardware Error]: error_type: 0, cache error





Uncorrectable Non-fatal Processor Error

When an uncorrectable, non-fatal processor error is injected, the BlueField console displays output similar to the following:

Copy Copied! [ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 5008.520195] {2}[Hardware Error]: event severity: recoverable [ 5008.525837] {2}[Hardware Error]: Error 0, type: recoverable [ 5008.531478] {2}[Hardware Error]: section_type: ARM processor error [ 5008.537813] {2}[Hardware Error]: MIDR: 0x00000000410fd421 [ 5008.543366] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081070100 [ 5008.552042] {2}[Hardware Error]: running state: 0x1 [ 5008.557074] {2}[Hardware Error]: Power State Coordination Interface state: 0 [ 5008.564275] {2}[Hardware Error]: Error info structure 0: [ 5008.569741] {2}[Hardware Error]: num errors: 2 [ 5008.574340] {2}[Hardware Error]: error_type: 0, cache error





Uncorrectable Fatal Processor Error

When an uncorrectable, fatal processor error is injected, the BlueField console displays output similar to the following:

Copy Copied! [ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 5008.520195] {2}[Hardware Error]: event severity: fatal [ 5008.525837] {2}[Hardware Error]: Error 0, type: fatal [ 5008.531478] {2}[Hardware Error]: section_type: ARM processor error [ 5008.537813] {2}[Hardware Error]: MIDR: 0x00000000410fd421 [ 5008.543366] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081070100 [ 5008.552042] {2}[Hardware Error]: running state: 0x1 [ 5008.557074] {2}[Hardware Error]: Power State Coordination Interface state: 0 [ 5008.564275] {2}[Hardware Error]: Error info structure 0: [ 5008.569741] {2}[Hardware Error]: num errors: 2 [ 5008.574340] {2}[Hardware Error]: error_type: 0, cache error

A double bit error should abort and reboot the system. In Anolis, the system aborts and reboots after a few seconds.

Note ras-tools has been verified for Anolis OS only. Use it with on other OSs at your own discretion.

Load the einj driver: Copy Copied! # modprobe einj Install ras-tools : Copy Copied! # yum install ras-tools

By default, the correctable errors threshold ( CeThreshold ) value is 5000 (i.e., CE is only reported to the user at the 5000th error). This means that the user would not see any CE error indication in the RShim log nor the Arm console as long as the amount of errors injected is not a multiple of 5000.

The following procedure demonstrates how CeThreshold can be modified using Redfish: