DRAM-ECC

Error correction code (ECC) protection provides a means to detect and correct errors within DRAM. Enabling DRAM-ECC ensures that for every 32 bytes of data, 2 bytes are allocated for ECC. This mechanism works by calculating and writing 2 ECC bytes alongside each data byte write. Similarly, during data reads, the 2 ECC bytes are read to verify data integrity, ensuring the stored ECC matches the calculated ECC. Any discrepancies are flagged by the Memory Subsystem (MSS) hardware.

Note

Activating ECC protection reduces the available memory for regular use to 7/8 of the total due to allocating additional memory for ECC data.

Components

Hardware

The hardware components involved in the DRAM-ECC feature are listed below:

Hardware Safety Manager (HSM): The HSM captures errors triggered by different hardware components and forwards them to the FSI firmware.
Memory Controller (MC): For every write to the DRAM, the memory controller hardware calculates the ECC bits and stores them in the reserved portion of DRAM. For every read from the DRAM, the memory controller hardware calculates the ECC bits and compares them to the stored ECC bits. In case of a discrepancy, an interrupt is generated to the HSM hardware block which is handled by the FSI firmware. The memory controller hardware corrects all the single bit errors and detects the double bit errors.

Software

The software components involved in the DRAM-ECC feature are listed below:

Bad Page Partition: On detection of a double bit error, the FSI firmware reboots the system and informs MB1 to exclude the bad page from available memory. MB1 stores this bad page address in the bad page partition. A bad page partition is a dedicated partition in QSPI that stores the list of bad pages. It is not part of any boot chain. During cold boot, MB1 uses entries from this partition to skip these pages while allocating carveouts. UEFI also uses these entries to skip these pages in the kernel’s physical memory handling.
Functional Safety Island (FSI): The FSI firmware handles all the interrupts collected by the HSM. The FSI firmware only has the required firewall settings to access the ECC status registers. In case of a single bit error, the FSI firmware takes no action as the memory controller hardware corrects the error while it is in transit. In case of a double bit error, the FSI firmware calculates the DRAM page address where the error is reported, caches it, and reboots for the bootloader components to retire the bad page.
Bootloader (MB1/MB2): The bootloader components MB1 and MB2 detect the bad pages identified by the FSI firmware and add them to the list of bad pages in the bad page partition. During the memory allocation by MB1, the available DRAM is cleaned by removing the bad pages from the bad page partition. This available memory is then propagated to subsequent boot components (MB2, UEFI, kernel) for their respective usage.

Stages

The DRAM-ECC feature is comprised of the following stages:

DRAM Init Scrub: During the ECC workflow in boot stages MB1 and MB2, the entire DRAM (excluding the MB1 and MB2 carveouts) is written. This process populates the ECC bits for the hardware to function properly.
Boot and Power Management Processor (BPMP) Patrol Scrub: The BPMP R5 scans the DRAM continuously at runtime, fixing any encountered single bit errors. This measure prevents the unlikely scenario of a second single bit error from impacting the same word and causing an uncorrectable double bit error.
Single Bit Error (SBE): This occurs when a single bit flips anywhere in the data or ECC. The memory controller hardware has the ability to correct the data while in transit without any software intervention. Note that the bit in the DRAM will remain flipped in this case. It would be corrected in transit on every read until either new data has been written to that location or BPMP Patrol Scrubbing has fixed the error in the DRAM.
Double Bit Error (DBE): This occurs when two bits flip anywhere in the data or ECC. Double bit errors are uncorrectable. The FSI firmware captures the location of this error, caches it for retirement in MB1/MB2, and then resets the device.
Page Retirement Flow: The page retirement flow can be activated by uncorrectable ECC errors or during the first boot after flashing.

Verification Steps

This section contains steps that pertain to verification of the DRAM-ECC feature. This process consists of the bootloader injecting errors into memory locations that are later read from the Linux command line.

Common Steps

The following steps are needed for testing of both SBEs and DBEs.

Enable error injection and flash the build:

# Enable ECC Injection flag
cd <Linux_for_Tegra>
vim bootloader/tegra234-mb1-bct-dram-ecc-l4t.dtsi
# Enable the injection configuration
enable_dram_error_injection = <1>;
# flash the board

MB1 reads the bad-page binary. All memory is assigned to the ECC region.

[0000.401] I> Task: Load Page retirement list
[0000.405] I> Slot: 0
[0000.407] I> Binary[4] block-125952 (partition size: 0x80000)
[0000.413] I> Binary name: DRAM bad page list (P)
[0000.417] I> Size of crypto header is 8192
[0000.421] I> Size of crypto header is 8192
[0000.425] I> strt_pg_num(125952) num_of_pgs(16) read_buf(0x40050000)
[0000.431] I> BCH of DRAM bad page list (P) read from storage
[0000.437] I> BCH address is : 0x40050000
[0000.441] I> component binary type is 4
[0000.444] I> DRAM bad page list (P) header integrity check is success
[0000.451] I> Binary magic in BCH component 0 is BINF
[0000.456] I> component binary type is 4
[0000.459] I> component binary type is 4
[0000.463] I> Size of crypto header is 8192
[0000.467] I> component binary type is 4
[0000.471] I> strt_pg_num(125968) num_of_pgs(8) read_buf(0x40040000)
[0000.477] I> DRAM bad page list (P) binary is read from storage
[0000.483] I> DRAM bad page list (P) binary integrity check is success
[0000.489] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)
[0000.499] I> Task: SDRAM params override
[0000.503] I> Task: Save mem-bct info
[0000.506] I> Task: Carveout allocate
[0000.510] I> RCM blob carveout will not be allocated
[0000.515] I> Update CCPLEX IST carveout from MB1-BCT
[0000.519] I> ECC region[0]: Start:0x80000000, End:0xe80000000
[0000.525] I> ECC region[1]: Start:0x0, End:0x0
[0000.529] I> ECC region[2]: Start:0x0, End:0x0
[0000.534] I> ECC region[3]: Start:0x0, End:0x0
[0000.538] I> ECC region[4]: Start:0x0, End:0x0
[0000.542] I> Non-ECC region[0]: Start:0x0, End:0x0
[0000.547] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.551] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.556] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.561] I> Non-ECC region[4]: Start:0x0, End:0x0

Record the carveout 49 base address.

[0000.795] I> allocated(CO:50) base:0xe2c600000 size:0x200000 align: 0x100000
[0000.802] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
[0000.808] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
[0000.815] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
[0000.822] I> allocated(CO:49) base:0xe2cd70000 size:0x10000 align: 0x10000

Single Bit Error (SBE) Testing

A single bit error is injected at address <Carveout 49 base> + 0x1000. Read that data from the kernel console.
ubuntu@jetson:~$ sudo ./devmem2 0xe2cd71000 w

Read the external memory controller channel status registers one by one from FSI-console to find the channel where ECC SBE count is increased. The bits 8:15 save the count of the SBE occurrences.

FSI-SHELL>readmemory 0x02c70ac4
20000000
FSI-SHELL>readmemory 0x02c80ac4
20000000
FSI-SHELL>readmemory 0x02c90ac4
20000000
FSI-SHELL>readmemory 0x02ca0ac4
20000000
FSI-SHELL>readmemory 0x02cb0ac4
20000000
FSI-SHELL>readmemory 0x02cc0ac4
20000000
FSI-SHELL>readmemory 0x02cd0ac4
20000000
FSI-SHELL>readmemory 0x02ce0ac4
20000000
FSI-SHELL>readmemory 0x01780ac4
20010100 **← Example only;**
FSI-SHELL>readmemory 0x01790ac4
20000000
FSI-SHELL>readmemory 0x017a0ac4
20000000
FSI-SHELL>readmemory 0x017b0ac4
20000000
FSI-SHELL>readmemory 0x017c0ac4
20000000
FSI-SHELL>readmemory 0x017d0ac4
20000000
FSI-SHELL>readmemory 0x017e0ac4
20000000
FSI-SHELL>readmemory 0x017f0ac4
20000000

Double Bit Error (DBE) Testing

A double bit error is injected at address <Carveout 49 base> + 0x8000. Read that data from the kernel console.
ubuntu@jetson:~$ sudo ./devmem2 0xe2cd78000 w

FSI triggers L1 Cold-boot and MB2 enters DRAM-ECC mode. Refer to the above diagram for the detailed flow of DRAM-ECC. Once the DRAM-ECC bad page update completes, the corresponding print will be seen on the console.

I> MB2 (version: 0.0.0.0-t234-54845784-c6a05a9f)
I> t234-A01-1-Silicon (0x12347)
I> Boot-mode : DRAM ECC
...
...
I> Read back verify success for primary and secondary blocks
I> Task: DRAM ECC Mode PMC Reset
I> Triggering PMC_RESET

In the next boot cycle, MB1 reads the bad page partition and retires the bad page.

[0000.480] I> DRAM bad page list (P) binary is read from storage
[0000.485] I> DRAM bad page list (P) binary integrity check is success
[0000.492] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)
[0000.502] I> bad page addr 0: 0xe2cd70000
[0000.506] I> Task: SDRAM params override
[0000.510] I> Task: Save mem-bct info
[0000.513] I> Task: Carveout allocate
[0000.517] I> RCM blob carveout will not be allocated
[0000.521] I> Update CCPLEX IST carveout from MB1-BCT
[0000.526] I> ECC region[0]: Start:0x80000000, End:0xe80000000
[0000.532] I> ECC region[1]: Start:0x0, End:0x0
[0000.536] I> ECC region[2]: Start:0x0, End:0x0
[0000.540] I> ECC region[3]: Start:0x0, End:0x0
[0000.545] I> ECC region[4]: Start:0x0, End:0x0
[0000.549] I> Non-ECC region[0]: Start:0x0, End:0x0
[0000.553] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.558] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.563] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.567] I> Non-ECC region[4]: Start:0x0, End:0x0

The carveout 49 address shifts to avoid the previously recorded bad page.

[0000.808] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
[0000.815] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
[0000.822] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
[0000.829] I> allocated(CO:49) base:0xe2cd60000 size:0x10000 align: 0x10000 ← Was previously 0xe2cd70000

The kernel memory map correspondingly excludes the bad page.

[ 0.000000] node 0: [mem 0x0000000080000000-0x00000000fffdffff]
[ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x0000000e18f95fff]
[ 0.000000] node 0: [mem 0x0000000e18f96000-0x0000000e1922bfff]
[ 0.000000] node 0: [mem 0x0000000e1922c000-0x0000000e2670ffff]
[ 0.000000] node 0: [mem 0x0000000e26710000-0x0000000e2864ffff]
[ 0.000000] node 0: [mem 0x0000000e28650000-0x0000000e2c5fffff]
[ 0.000000] node 0: [mem 0x0000000e2c600000-0x0000000e2c7fffff]
[ 0.000000] node 0: [mem 0x0000000e2c800000-0x0000000e2cd5ffff] ←
Does not contain 0xe2cd60000 and 0xe2cd70000 pages
[ 0.000000] node 0: [mem 0x0000000e32000000-0x0000000e33ffffff]