DDR Error Reporting and Handling
RAS stands for Reliability, availability and serviceability.
Reliability = the continuity of correct service
Availability = the readiness for correct service
Serviceability = the ability to undergo modifications and repairs
RAS reduces and avoids unplanned outages because:
Errors can be detected and corrected by the hardware before they cause system failures.
Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.
Errors can be predicted ahead-of-time to allow replacement.
The BlueField-3 has RAS support for DRAM errors which can be either:
Single-bit errors aka correctable errors; or
Double-bit errors which could be:
Non-fatal errors (recoverable)
Fatal errors (non-recoverable)
In DPU mode, the OS generates an error report and prints it to the console.
Command | Description |
| The dmesg command in Linux displays the kernel message buffer and will show the full DRAM error report |
| Load the error injection driver |
| Lists available errors to inject |
| Inject error at physical address |
| Physical address mask |
| From the list of |
| Trigger error after configuring all parameters above |
| Enable debug level in the RShim log and dump the RShim log |
DRAM-related Issues
Although rare, DRAM errors can occur. These are handled according to their severity.
What to Do If a DRAM Error Occurs
Correctable Errors
Correctable errors (CE), also known as single-bit ECC errors, are non-fatal and are automatically corrected by hardware. No user action is required.
Expected OS report for a CE event:
ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91
[ 234.638586
] {1
}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 234.638588
] {1
}[Hardware Error]: event severity: corrected
[ 234.638590
] {1
}[Hardware Error]: Error 0
, type: corrected
[ 234.638591
] {1
}[Hardware Error]: section_type: memory error
[ 234.638592
] {1
}[Hardware Error]: error_status: 0x0000000000010400
[ 234.638594
] {1
}[Hardware Error]: physical_address: 0x0000000000000080
[ 234.638595
] {1
}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 234.638598
] {1
}[Hardware Error]: module: 0
rank: 0
bank: 0
bank_group: 0
row: 0
column: 0
bit_position: 480
[ 234.638599
] {1
}[Hardware Error]: error_type: 2
, single-bit ECC
[ 234.638617
] EDAC MC0: 1
CE Single-bit ECC on unknown memory (module:0
rank:0
bank:0
bank_group:0
row:0
col:0
bit_pos:480
page:0x0
offset:0x80
grain:-2814749776710655
syndrome:0x0
- APEI location: module:0
rank:0
bank:0
bank_group:0
row:0
col:0
bit_pos:480
status(0x0000000000010400
): Storage error in DRAM memory)
Uncorrectable Fatal Errors
Uncorrectable fatal errors are double-bit ECC errors that may result in a system abort or reboot. Some Linux distributions will reboot automatically, while others may panic. In the latter case, manually reboot or power-cycle the system.
Expected Ubuntu report (PSB) for a UE fatal event:
ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93
[ 313.874148
] {1
}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 313.885102
] {1
}[Hardware Error]: event severity: fatal
[ 313.890229
] {1
}[Hardware Error]: Error 0
, type: fatal
[ 313.895352
] {1
}[Hardware Error]: section_type: memory error
[ 313.901080
] {1
}[Hardware Error]: error_status: 0x0000000000010400
[ 313.907330
] {1
}[Hardware Error]: physical_address: 0x0000000000000080
[ 313.913925
] {1
}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 313.920956
] {1
}[Hardware Error]: module: 0
rank: 0
bank: 0
bank_group: 0
row: 0
column: 0
bit_position: 0
[ 313.930761
] {1
}[Hardware Error]: error_type: 3
, multi-bit ECC
[ 313.936667
] Kernel panic - not syncing: Fatal hardware error!
...
[ 315.904541
] Rebooting in 10
seconds..
Uncorrectable Non-fatal Errors
Uncorrectable non-fatal errors, also known as uncorrectable recoverable errors, are double-bit ECC errors that do not interrupt services. The OS will handle the error by retiring the faulty page, and no user action is required.
Expected Linux report for a UE non-fatal event:
ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ XX
[ 219.784317
] {1
}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 219.795253
] {1
}[Hardware Error]: event severity: recoverable
[ 219.800897
] {1
}[Hardware Error]: Error 0
, type: recoverable
[ 219.806540
] {1
}[Hardware Error]: section_type: memory error
[ 219.812269
] {1
}[Hardware Error]: error_status: 0x0000000000010400
[ 219.818518
] {1
}[Hardware Error]: physical_address: 0x0000000000000080
[ 219.825114
] {1
}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 219.832146
] {1
}[Hardware Error]: module: 0
rank: 0
bank: 0
bank_group: 0
row: 0
column: 0
bit_position: 0
[ 219.841952
] {1
}[Hardware Error]: error_type: 3
, multi-bit ECC
[ 219.847870
] EDAC MC0: 1
UE Multi-bit ECC on unknown memory ...
[ 219.847875
] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80
Issues with Error Injection
Error Injection Does Not Work
Ensure you follow these steps when injecting errors:
Load the
einj
driver:modprobe einj cd /sys/kernel/debug/apei/einj
Verify the available parameters:
ls
Example output:
available_error_type error_type notrigger param2 param4 error_inject flags param1 param3
Configure parameters:
Target physical address (preferably near the end of available memory):
echo
0x400000000
> param1Address mask:
echo
0xfffffffffffff000
> param2Error type:
echo
0x8
> error_type
Available error types:
[root
@localhost
einj]# cat available_error_type0x00000008
Memory Correctable0x00000010
Memory Uncorrectable non-fatal0x00000020
Memory Uncorrectable fatalTrigger the error:
echo
1
> error_injectNoteIf
/sys/kernel/debug/apei/einj/notrigger
is set to1
, you must trigger the error manually (e.g., memory access).To test manually using user-space memory:
git clone https:
//kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools
cd ras-tools make -j8
./victim -d # Use printed physical address in your injection steps
CE Error Injection Does Not Work
By default, the CE threshold is set to 5000
. You must inject 5000 errors before they are reported by the OS unless you change the threshold.
Change the threshold to
0
via UEFI or Redfish to report CE errors immediately.
BMC Does Not Receive Error Report
If the BMC does not report the injected error:
Verify BMC firmware is up to date.
Check that Arm reported the error:
Review the BlueField console logs.
Check for messages like
ERROR: BL31: MSS1 C0 Double bit
.Run:
dmesg | tail
Enable and check RShim logs:
echo
"DISPLAY_LEVEL 2"
> /dev/rshim0/misc cat /dev/rshim0/misc
If no errors appear, update the BFB and enable BMC software updates.