What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

DDR Error Reporting and handling

RAS stands for Reliability, availability and serviceability.

  • Reliability = the continuity of correct service

  • Availability = the readiness for correct service

  • Serviceability = the ability to undergo modifications and repairs

RAS reduces and avoids unplanned outages because:

  • Errors can be detected and corrected by the hardware before they cause system failures.

  • Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.

  • Errors can be predicted ahead-of-time to allow replacement.

The BlueField-3 has RAS support for DRAM errors which can be either:

  • single-bit errors aka correctable errors

  • or double-bit errors which could be:

    • non fatal errors (recoverable)

    • fatal errors (non recoverable)

In DPU mode, the OS generates an error report and prints it to the console.

1.1 Correctable Errors

Correctable Errors (CE) aka Single bit errors are non fatal errors and are corrected by hardware so no action is required by the user.

Expected OS report when an CE occurs:

ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91 [ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action

[ 234.638588] {1}[Hardware Error]: event severity: corrected

[ 234.638590] {1}[Hardware Error]: Error 0, type: corrected

[ 234.638591] {1}[Hardware Error]: section_type: memory error

[ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400

[ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080

[ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff

[ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480

[ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC

[ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)


1.2 Uncorrectable fatal Errors

Uncorrectable fatal Errors are Double bit errors that cause the system to abort or reboot. Some Linux distributions reboot the system automatically and others just panic. If the latter, reboot or powercycle your system to recover.

PSB for Expected Ubuntu distro report when an UE fatal occurs. In anolis, Linux does not reboot. It panic.

root@localhost:~# ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93

[ 313.874148] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

[ 313.885102] {1}[Hardware Error]: event severity: fatal

[ 313.890229] {1}[Hardware Error]: Error 0, type: fatal

[ 313.895352] {1}[Hardware Error]: section_type: memory error

[ 313.901080] {1}[Hardware Error]: error_status: 0x0000000000010400

[ 313.907330] {1}[Hardware Error]: physical_address: 0x0000000000000080

[ 313.913925] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff

[ 313.920956] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0

[ 313.930761] {1}[Hardware Error]: error_type: 3, multi-bit ECC

[ 313.936667] Kernel panic - not syncing: Fatal hardware error!

[ 313.942399] CPU: 0 PID: 186 Comm: kworker/0:3 Tainted: G O 5.15.0-1042-bluefield #44-Ubuntu

[ 313.952032] Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS 0.0.0.0 Mar 20 2024

[ 313.961577] Workqueue: kacpi_notify acpi_os_execute_deferred

[ 313.967229] Call trace:

[ 313.969661] dump_backtrace+0x0/0x200

[ 313.973310] show_stack+0x20/0x2c

[ 313.976610] dump_stack_lvl+0x68/0x84

[ 313.980259] dump_stack+0x18/0x34

[ 313.983558] panic+0x1b0/0x3a0

[ 313.986600] __raw_spin_lock_irqsave.constprop.0+0x0/0xcc

[ 313.991984] ghes_proc+0x148/0x200

[ 313.995371] ghes_notify_hed+0x58/0xd4

[ 313.999105] blocking_notifier_call_chain+0x74/0xac

[ 314.003968] acpi_hed_notify+0x28/0x3c

[ 314.007701] acpi_notify_device+0x24/0x30

[ 314.011696] acpi_ev_notify_dispatch+0x50/0x80

[ 314.016125] acpi_os_execute_deferred+0x24/0x3c

[ 314.020639] process_one_work+0x210/0x4f0

[ 314.024634] worker_thread+0x170/0x5a0

[ 314.028366] kthread+0x110/0x114

[ 314.031580] ret_from_fork+0x10/0x20

[ 314.035142] SMP: stopping secondary CPUs

[ 314.039076] Kernel Offset: 0x52ace63f0000 from 0xffff800008000000

[ 314.045152] PHYS_OFFSET: 0x80000000

[ 314.048624] CPU features: 0x0,000005c1,a3332e5a

[ 314.053139] Memory Limit: none

[ 315.904541] Rebooting in 10 seconds..

Nvidia BlueField-3 rev1 BL1 V1.0

INFO: psc supervisor init.

INFO: psc_irq_init...

INFO: force_crs_enable=0 pcr.lock0 = 0, time = 110628

INFO: enter idle task.

NOTICE: Running as 9009D3B400ENEA system

NOTICE: BL2: v2.2(release):4.7.0-25-g5569834

NOTICE: BL2: Built : 22:07:10, Apr 26 2024

NOTICE: BL2 built for hw (ver 2)


1.3 Uncorrectable non-fatal Errors

Uncorrectable non-fatal errors aka Uncorrectable recoverable errors are double bit errors which do not interrupt services. In this case, the OS will handle the error by retiring the faulty page and avoid service interruption. So there is no action required from the user.

Expected Linux report when UE non fatal occurs:

ERROR: BL31: MSS1 C0 Double bit

ECC error detected. IR

[ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

[ 219.795253] {1}[Hardware Error]: event severity: recoverable

[ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable

[ 219.806540] {1}[Hardware Error]: section_type: memory error

[ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400

[ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080

[ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff

[ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0

[ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC

[ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 page:0x0 offset:0x80 grain:-281474976710655 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 status(0x0000000000010400): Storage error in DRAM memory)

[ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

1.1 Error injection doesn't work

Make sure to follow these steps for injecting errors:

In Linux, the einj driver creates a sysfs for injecting the error as demonstrated below.

  • Load the einj driver on the BF3:

#modprobe einj

  • The EINJ sysfs is:

#cd /sys/kernel/debug/apei/einj

The directory contains the following files:

[root@localhost ~]# cd /sys/kernel/debug/apei/einj/

[root@localhost einj]# ls

available_error_type error_type notrigger param2 param4

error_inject flags param1 param3

  • specify the physical address you want to inject the error to. This depends on the size of your memory obviously. It is preferable to pick addresses at the end of memory.

#echo 0x400000000 > param1

  • specify the address mask you want to use

#echo 0xfffffffffffff000 > param2

  • specify the error type you want to inject:

# echo 0x8 > error_type

  • The available error types:

[root@localhost einj]# cat available_error_type

0x00000008 Memory Correctable

0x00000010 Memory Uncorrectable non-fatal

0x00000020 Memory Uncorrectable fatal

Error types' definitions:

    • 0x8 - Memory Correctable = Single bit error. All single bit errors are non fatal and corrected by hardware.

    • 0x10 - Memory Uncorrectable non-fatal = Uncorrectable recoverable error = double bit error which does not interrupt services. So it does not necessarily cause an abort/panic.

    • 0x20 - Memory Uncorrectable fatal = Double bit error that causes the system to abort or reboot.

  • Trigger the error

# echo 1 > error_inject

By default, /sys/kernel/debug/apei/einj/notrigger is set to 0 which means the patrol scrubber will trigger the error.

Setting "notrigger" to 1 skips the trigger via the patrol scrubber, which allows the user to cause the error in some other context by a simple access to the CPU, memory location, or device that is the target of the error injection.

In this case, you may use the ras-tools package as follows:

ras-tools has a program called “victim” which will allocates a page and print the PA. That address can be used as the target for error injection so then it is guaranteed to be memory allocated to a user space process. This helps when you want to ensure the injection of an error within Linux' address range.

# ./ras-tools/victim -d

physical address of (0xffffa0f4b000) = 0x1750e2000

Hit any key to trigger error:

Then inject the error:

#modprobe einj

#cd /sys/kernel/debug/apei/einj/

#echo 0x1750e2000 > param1

#echo 0xfffffffffffff000 > param2

#echo 0x08 > error_type

#echo 1 > notrigger

#echo 1 > error_inject

1.2 CE error injection doesn't work

By default, the CE threshold is set to 5000. So either inject 5000 errors before seeing the report in Linux or change the threshold to 0 to see the error report right away. The CE threshold value can be modified from the UEFI menu or via redfish.

Make sure that the BMC software version is up to date.

Make sure that ARM reported the error:

  • check the ARM BF console. You should at least see this message: ERROR: BL31: MSS1 C0 Double bit

  • run "dmesg | tail" to see if any of the reports mentioned in section 1 occurred.

  • Use the rshim log to verify that ARM reported the error:

    • echo "DISPLAY_LEVEL 2" > /dev/dev/rshim0/misc

    • cat /dev/rshim0/misc

If nothing was logged in the rshim or dmesg log, upgrade to the latest BFB and make sure that you enabled BMC software update as well.

Command

Description

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.