DDR Error Reporting and Handling

Preface

RAS stands for Reliability, availability and serviceability.

Reliability = the continuity of correct service
Availability = the readiness for correct service
Serviceability = the ability to undergo modifications and repairs

RAS reduces and avoids unplanned outages because:

Errors can be detected and corrected by the hardware before they cause system failures.
Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.
Errors can be predicted ahead-of-time to allow replacement.

The BlueField-3 has RAS support for DRAM errors which can be either:

Single-bit errors aka correctable errors; or
Double-bit errors which could be:
- Non-fatal errors (recoverable)
- Fatal errors (non-recoverable)

In DPU mode, the OS generates an error report and prints it to the console.

Command Cheat Sheet

Command	Description
`dmesg`	The dmesg command in Linux displays the kernel message buffer and will show the full DRAM error report
`modprobe einj`	Load the error injection driver
`cat` `/sys/kernel/debug/apei/einj/available_error_type`	Lists available errors to inject
`echo <address> > /sys/kernel/debug/apei/einj/param1`	Inject error at physical address `<address>`
`echo <address> > /sys/kernel/debug/apei/einj/param2`	Physical address mask
`echo <address> > /sys/kernel/debug/apei/einj/error_type`	From the list of `available_error_type`
`echo <address> > /sys/kernel/debug/apei/einj/error_inject`	Trigger error after configuring all parameters above
`echo "DISPLAY_LEVEL 2" > /dev/dev/rshim0/misc` `cat /dev/rshim0/misc`	Enable debug level in the RShim log and dump the RShim log

DRAM-related Issues

Although rare, DRAM errors can occur. These are handled according to their severity.

What to Do If a DRAM Error Occurs

Correctable Errors

Correctable errors (CE), also known as single-bit ECC errors, are non-fatal and are automatically corrected by hardware. No user action is required.

Expected OS report for a CE event:

Copy
Copied!

            
            ERROR:   BL31: MSS1 C0 Single bit ECC error detected. IRQ 91
 
[ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 234.638588] {1}[Hardware Error]: event severity: corrected
[ 234.638590] {1}[Hardware Error]:  Error 0, type: corrected
[ 234.638591] {1}[Hardware Error]:  section_type: memory error
[ 234.638592] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 234.638594] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 234.638595] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 234.638598] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480
[ 234.638599] {1}[Hardware Error]:  error_type: 2, single-bit ECC
[ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)

Uncorrectable fatal errors are double-bit ECC errors that may result in a system abort or reboot. Some Linux distributions will reboot automatically, while others may panic. In the latter case, manually reboot or power-cycle the system.

Expected Ubuntu report (PSB) for a UE fatal event:

Copy
Copied!

            
            ERROR:   BL31: MSS1 C0 Double bit ECC error detected. IRQ 93
 
[ 313.874148] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 313.885102] {1}[Hardware Error]: event severity: fatal
[ 313.890229] {1}[Hardware Error]:  Error 0, type: fatal
[ 313.895352] {1}[Hardware Error]:  section_type: memory error
[ 313.901080] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 313.907330] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 313.913925] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 313.920956] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 313.930761] {1}[Hardware Error]:  error_type: 3, multi-bit ECC
[ 313.936667] Kernel panic - not syncing: Fatal hardware error!
...
[ 315.904541] Rebooting in 10 seconds..

Uncorrectable Non-fatal Errors

Uncorrectable non-fatal errors, also known as uncorrectable recoverable errors, are double-bit ECC errors that do not interrupt services. The OS will handle the error by retiring the faulty page, and no user action is required.

Expected Linux report for a UE non-fatal event:

Copy
Copied!

            
            ERROR:   BL31: MSS1 C0 Double bit ECC error detected. IRQ XX
 
[ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 219.795253] {1}[Hardware Error]: event severity: recoverable
[ 219.800897] {1}[Hardware Error]:  Error 0, type: recoverable
[ 219.806540] {1}[Hardware Error]:  section_type: memory error
[ 219.812269] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 219.818518] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 219.825114] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 219.832146] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 219.841952] {1}[Hardware Error]:  error_type: 3, multi-bit ECC
[ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory ...
[ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

Issues with Error Injection

Error Injection Does Not Work

Ensure you follow these steps when injecting errors:

Load the einj driver:

Copy
Copied!

            
            modprobe einj
cd /sys/kernel/debug/apei/einj

Verify the available parameters:

Copy
Copied!

ls

Example output:

Copy
Copied!

            
            available_error_type  error_type  notrigger  param2  param4
error_inject          flags       param1     param3

Configure parameters:

Target physical address (preferably near the end of available memory):

Copy
Copied!

            
            echo 0x400000000 > param1

Address mask:

Copy
Copied!

            
            echo 0xfffffffffffff000 > param2

Error type:

Copy
Copied!

            
            echo 0x8 > error_type

Available error types:

Copy
Copied!

            
            [root@localhost einj]# cat available_error_type 
0x00000008      Memory Correctable
0x00000010      Memory Uncorrectable non-fatal
0x00000020      Memory Uncorrectable fatal

Trigger the error:

Copy
Copied!

            
            echo 1 > error_inject

Note

If /sys/kernel/debug/apei/einj/notrigger is set to 1, you must trigger the error manually (e.g., memory access).

To test manually using user-space memory:

Copy
Copied!

            
            git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools
cd ras-tools
make -j 8
./victim -d
# Use printed physical address in your injection steps

CE Error Injection Does Not Work

By default, the CE threshold is set to 5000. You must inject 5000 errors before they are reported by the OS unless you change the threshold.

Change the threshold to 0 via UEFI or Redfish to report CE errors immediately.

BMC Does Not Receive Error Report

If the BMC does not report the injected error:

Verify BMC firmware is up to date.

Check that Arm reported the error:

Review the BlueField console logs.
Check for messages like ERROR: BL31: MSS1 C0 Double bit.

Run:

Copy
Copied!

            
            dmesg | tail

Enable and check RShim logs:

Copy
Copied!

            
            echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc
cat /dev/rshim0/misc

If no errors appear, update the BFB and enable BMC software updates.

On This Page