NVIDIA BlueField Platform Software Troubleshooting Guide

DDR Error Reporting and Handling

RAS stands for Reliability, availability and serviceability.

  • Reliability = the continuity of correct service

  • Availability = the readiness for correct service

  • Serviceability = the ability to undergo modifications and repairs

RAS reduces and avoids unplanned outages because:

  • Errors can be detected and corrected by the hardware before they cause system failures.

  • Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.

  • Errors can be predicted ahead-of-time to allow replacement.

The BlueField-3 has RAS support for DRAM errors which can be either:

  • Single-bit errors aka correctable errors; or

  • Double-bit errors which could be:

    • Non-fatal errors (recoverable)

    • Fatal errors (non-recoverable)

In DPU mode, the OS generates an error report and prints it to the console.

Command

Description

dmesg

The dmesg command in Linux displays the kernel message buffer and will show the full DRAM error report

modprobe einj

Load the error injection driver

cat

/sys/kernel/debug/apei/einj/available_error_type

Lists available errors to inject

echo <address> > /sys/kernel/debug/apei/einj/param1

Inject error at physical address <address>

echo <address> > /sys/kernel/debug/apei/einj/param2

Physical address mask

echo <address> > /sys/kernel/debug/apei/einj/error_type

From the list of available_error_type

echo <address> > /sys/kernel/debug/apei/einj/error_inject

Trigger error after configuring all parameters above

echo "DISPLAY_LEVEL 2" > /dev/dev/rshim0/misc

cat /dev/rshim0/misc

Enable debug level in the RShim log and dump the RShim log

DRAM-related Issues

Although rare, DRAM errors can occur. These are handled according to their severity.

What to Do If a DRAM Error Occurs

Correctable Errors

Correctable errors (CE), also known as single-bit ECC errors, are non-fatal and are automatically corrected by hardware. No user action is required.

Expected OS report for a CE event:

Copy
Copied!
            

ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91   [ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 234.638588] {1}[Hardware Error]: event severity: corrected [ 234.638590] {1}[Hardware Error]: Error 0, type: corrected [ 234.638591] {1}[Hardware Error]: section_type: memory error [ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480 [ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC [ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)


Uncorrectable Fatal Errors

Uncorrectable fatal errors are double-bit ECC errors that may result in a system abort or reboot. Some Linux distributions will reboot automatically, while others may panic. In the latter case, manually reboot or power-cycle the system.

Expected Ubuntu report (PSB) for a UE fatal event:

Copy
Copied!
            

ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93   [ 313.874148] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 313.885102] {1}[Hardware Error]: event severity: fatal [ 313.890229] {1}[Hardware Error]: Error 0, type: fatal [ 313.895352] {1}[Hardware Error]: section_type: memory error [ 313.901080] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 313.907330] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 313.913925] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 313.920956] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0 [ 313.930761] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 313.936667] Kernel panic - not syncing: Fatal hardware error! ... [ 315.904541] Rebooting in 10 seconds..


Uncorrectable Non-fatal Errors

Uncorrectable non-fatal errors, also known as uncorrectable recoverable errors, are double-bit ECC errors that do not interrupt services. The OS will handle the error by retiring the faulty page, and no user action is required.

Expected Linux report for a UE non-fatal event:

Copy
Copied!
            

ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ XX   [ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 219.795253] {1}[Hardware Error]: event severity: recoverable [ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable [ 219.806540] {1}[Hardware Error]: section_type: memory error [ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0 [ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory ... [ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

Issues with Error Injection

Error Injection Does Not Work

Ensure you follow these steps when injecting errors:

  1. Load the einj driver:

    Copy
    Copied!
                

    modprobe einj cd /sys/kernel/debug/apei/einj

  2. Verify the available parameters:

    Copy
    Copied!
                

    ls

    Example output:

    Copy
    Copied!
                

    available_error_type  error_type  notrigger  param2  param4 error_inject          flags       param1     param3

  3. Configure parameters:

    1. Target physical address (preferably near the end of available memory):

      Copy
      Copied!
                  

      echo 0x400000000 > param1

    2. Address mask:

      Copy
      Copied!
                  

      echo 0xfffffffffffff000 > param2

    3. Error type:

      Copy
      Copied!
                  

      echo 0x8 > error_type

  4. Available error types:

    Copy
    Copied!
                

    [root@localhost einj]# cat available_error_type  0x00000008      Memory Correctable 0x00000010      Memory Uncorrectable non-fatal 0x00000020      Memory Uncorrectable fatal

  5. Trigger the error:

    Copy
    Copied!
                

    echo 1 > error_inject

    Note

    If /sys/kernel/debug/apei/einj/notrigger is set to 1, you must trigger the error manually (e.g., memory access).

    To test manually using user-space memory:

    Copy
    Copied!
                

    git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools cd ras-tools make -j 8 ./victim -d # Use printed physical address in your injection steps

CE Error Injection Does Not Work

By default, the CE threshold is set to 5000. You must inject 5000 errors before they are reported by the OS unless you change the threshold.

  • Change the threshold to 0 via UEFI or Redfish to report CE errors immediately.

BMC Does Not Receive Error Report

If the BMC does not report the injected error:

  1. Verify BMC firmware is up to date.

  2. Check that Arm reported the error:

    • Review the BlueField console logs.

    • Check for messages like ERROR: BL31: MSS1 C0 Double bit.

    • Run:

      Copy
      Copied!
                  

      dmesg | tail

    • Enable and check RShim logs:

      Copy
      Copied!
                  

      echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc cat /dev/rshim0/misc

  3. If no errors appear, update the BFB and enable BMC software updates.

© Copyright 2025, NVIDIA. Last updated on Jul 27, 2025.