DRAM-ECC

Error correction code (ECC) protection provides a means to detect and correct errors within DRAM. Enabling DRAM-ECC ensures that for every 32 bytes of data, 2 bytes are allocated for ECC. This mechanism works by calculating and writing 2 ECC bytes alongside each data byte write. Similarly, during data reads, the 2 ECC bytes are read to verify data integrity, ensuring the stored ECC matches the calculated ECC. Any discrepancies are flagged by the Memory Subsystem (MSS) hardware.

Note

Activating ECC protection reduces the available memory for regular use to 7/8 of the total due to allocating additional memory for ECC data.

Components

Hardware

The hardware components involved in the DRAM-ECC feature are listed below:

  • Hardware Safety Manager (HSM): The HSM captures errors triggered by different hardware components and forwards them to the FSI firmware.

  • Memory Controller (MC): For every write to the DRAM, the memory controller hardware calculates the ECC bits and stores them in the reserved portion of DRAM. For every read from the DRAM, the memory controller hardware calculates the ECC bits and compares them to the stored ECC bits. In case of a discrepancy, an interrupt is generated to the HSM hardware block which is handled by the FSI firmware. The memory controller hardware corrects all the single bit errors and detects the double bit errors.

Software

The software components involved in the DRAM-ECC feature are listed below:

  • Bad Page Partition: On detection of a double bit error, the FSI firmware reboots the system and informs MB1 to exclude the bad page from available memory. MB1 stores this bad page address in the bad page partition. A bad page partition is a dedicated partition in QSPI that stores the list of bad pages. It is not part of any boot chain. During cold boot, MB1 uses entries from this partition to skip these pages while allocating carveouts. UEFI also uses these entries to skip these pages in the kernel’s physical memory handling.

  • Functional Safety Island (FSI): The FSI firmware handles all the interrupts collected by the HSM. The FSI firmware only has the required firewall settings to access the ECC status registers. In case of a single bit error, the FSI firmware takes no action as the memory controller hardware corrects the error while it is in transit. In case of a double bit error, the FSI firmware calculates the DRAM page address where the error is reported, caches it, and reboots for the bootloader components to retire the bad page.

  • Bootloader (MB1/MB2): The bootloader components MB1 and MB2 detect the bad pages identified by the FSI firmware and add them to the list of bad pages in the bad page partition. During the memory allocation by MB1, the available DRAM is cleaned by removing the bad pages from the bad page partition. This available memory is then propagated to subsequent boot components (MB2, UEFI, kernel) for their respective usage.

Stages

The DRAM-ECC feature is comprised of the following stages:

  • DRAM Init Scrub: During the ECC workflow in boot stages MB1 and MB2, the entire DRAM (excluding the MB1 and MB2 carveouts) is written. This process populates the ECC bits for the hardware to function properly.

  • Boot and Power Management Processor (BPMP) Patrol Scrub: The BPMP R5 scans the DRAM continuously at runtime, fixing any encountered single bit errors. This measure prevents the unlikely scenario of a second single bit error from impacting the same word and causing an uncorrectable double bit error.

  • Single Bit Error (SBE): This occurs when a single bit flips anywhere in the data or ECC. The memory controller hardware has the ability to correct the data while in transit without any software intervention. Note that the bit in the DRAM will remain flipped in this case. It would be corrected in transit on every read until either new data has been written to that location or BPMP Patrol Scrubbing has fixed the error in the DRAM.

  • Double Bit Error (DBE): This occurs when two bits flip anywhere in the data or ECC. Double bit errors are uncorrectable. The FSI firmware captures the location of this error, caches it for retirement in MB1/MB2, and then resets the device.

  • Page Retirement Flow: The page retirement flow can be activated by uncorrectable ECC errors or during the first boot after flashing.

../../_images/Page-Retirement-Flow.png

Verification Steps

This section contains steps that pertain to verification of the DRAM-ECC feature. This process consists of the bootloader injecting errors into memory locations that are later read from the Linux command line.

Common Steps

The following steps are needed for testing of both SBEs and DBEs.

  1. Enable error injection and flash the build:

    1# Enable ECC Injection flag
    2cd <Linux_for_Tegra>
    3vim bootloader/tegra234-mb1-bct-dram-ecc-l4t.dtsi
    4# Enable the injection configuration
    5enable_dram_error_injection = <1>;
    6# flash the board
    
  2. MB1 reads the bad-page binary. All memory is assigned to the ECC region.

     1[0000.401] I> Task: Load Page retirement list
     2[0000.405] I> Slot: 0
     3[0000.407] I> Binary[4] block-125952 (partition size: 0x80000)
     4[0000.413] I> Binary name: DRAM bad page list (P)
     5[0000.417] I> Size of crypto header is 8192
     6[0000.421] I> Size of crypto header is 8192
     7[0000.425] I> strt_pg_num(125952) num_of_pgs(16) read_buf(0x40050000)
     8[0000.431] I> BCH of DRAM bad page list (P) read from storage
     9[0000.437] I> BCH address is : 0x40050000
    10[0000.441] I> component binary type is 4
    11[0000.444] I> DRAM bad page list (P) header integrity check is success
    12[0000.451] I> Binary magic in BCH component 0 is BINF
    13[0000.456] I> component binary type is 4
    14[0000.459] I> component binary type is 4
    15[0000.463] I> Size of crypto header is 8192
    16[0000.467] I> component binary type is 4
    17[0000.471] I> strt_pg_num(125968) num_of_pgs(8) read_buf(0x40040000)
    18[0000.477] I> DRAM bad page list (P) binary is read from storage
    19[0000.483] I> DRAM bad page list (P) binary integrity check is success
    20[0000.489] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)
    21[0000.499] I> Task: SDRAM params override
    22[0000.503] I> Task: Save mem-bct info
    23[0000.506] I> Task: Carveout allocate
    24[0000.510] I> RCM blob carveout will not be allocated
    25[0000.515] I> Update CCPLEX IST carveout from MB1-BCT
    26[0000.519] I> ECC region[0]: Start:0x80000000, End:0xe80000000
    27[0000.525] I> ECC region[1]: Start:0x0, End:0x0
    28[0000.529] I> ECC region[2]: Start:0x0, End:0x0
    29[0000.534] I> ECC region[3]: Start:0x0, End:0x0
    30[0000.538] I> ECC region[4]: Start:0x0, End:0x0
    31[0000.542] I> Non-ECC region[0]: Start:0x0, End:0x0
    32[0000.547] I> Non-ECC region[1]: Start:0x0, End:0x0
    33[0000.551] I> Non-ECC region[2]: Start:0x0, End:0x0
    34[0000.556] I> Non-ECC region[3]: Start:0x0, End:0x0
    35[0000.561] I> Non-ECC region[4]: Start:0x0, End:0x0
    
  3. Record the carveout 49 base address.

    1[0000.795] I> allocated(CO:50) base:0xe2c600000 size:0x200000 align: 0x100000
    2[0000.802] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
    3[0000.808] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
    4[0000.815] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
    5[0000.822] I> allocated(CO:49) base:0xe2cd70000 size:0x10000 align: 0x10000
    

Single Bit Error (SBE) Testing

  1. A single bit error is injected at address <Carveout 49 base> + 0x1000. Read that data from the kernel console.

    ubuntu@jetson:~$ sudo ./devmem2 0xe2cd71000 w
    
  2. Read the external memory controller channel status registers one by one from FSI-console to find the channel where ECC SBE count is increased. The bits 8:15 save the count of the SBE occurrences.

     1FSI-SHELL>readmemory 0x02c70ac4
     220000000
     3FSI-SHELL>readmemory 0x02c80ac4
     420000000
     5FSI-SHELL>readmemory 0x02c90ac4
     620000000
     7FSI-SHELL>readmemory 0x02ca0ac4
     820000000
     9FSI-SHELL>readmemory 0x02cb0ac4
    1020000000
    11FSI-SHELL>readmemory 0x02cc0ac4
    1220000000
    13FSI-SHELL>readmemory 0x02cd0ac4
    1420000000
    15FSI-SHELL>readmemory 0x02ce0ac4
    1620000000
    17FSI-SHELL>readmemory 0x01780ac4
    1820010100 **← Example only;**
    19FSI-SHELL>readmemory 0x01790ac4
    2020000000
    21FSI-SHELL>readmemory 0x017a0ac4
    2220000000
    23FSI-SHELL>readmemory 0x017b0ac4
    2420000000
    25FSI-SHELL>readmemory 0x017c0ac4
    2620000000
    27FSI-SHELL>readmemory 0x017d0ac4
    2820000000
    29FSI-SHELL>readmemory 0x017e0ac4
    3020000000
    31FSI-SHELL>readmemory 0x017f0ac4
    3220000000
    

Double Bit Error (DBE) Testing

  1. A double bit error is injected at address <Carveout 49 base> + 0x8000. Read that data from the kernel console.

    ubuntu@jetson:~$ sudo ./devmem2 0xe2cd78000 w
    
  2. FSI triggers L1 Cold-boot and MB2 enters DRAM-ECC mode. Refer to the above diagram for the detailed flow of DRAM-ECC. Once the DRAM-ECC bad page update completes, the corresponding print will be seen on the console.

    1I> MB2 (version: 0.0.0.0-t234-54845784-c6a05a9f)
    2I> t234-A01-1-Silicon (0x12347)
    3I> Boot-mode : DRAM ECC
    4...
    5...
    6I> Read back verify success for primary and secondary blocks
    7I> Task: DRAM ECC Mode PMC Reset
    8I> Triggering PMC_RESET
    
  3. In the next boot cycle, MB1 reads the bad page partition and retires the bad page.

     1[0000.480] I> DRAM bad page list (P) binary is read from storage
     2[0000.485] I> DRAM bad page list (P) binary integrity check is success
     3[0000.492] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)
     4[0000.502] I> bad page addr 0: 0xe2cd70000
     5[0000.506] I> Task: SDRAM params override
     6[0000.510] I> Task: Save mem-bct info
     7[0000.513] I> Task: Carveout allocate
     8[0000.517] I> RCM blob carveout will not be allocated
     9[0000.521] I> Update CCPLEX IST carveout from MB1-BCT
    10[0000.526] I> ECC region[0]: Start:0x80000000, End:0xe80000000
    11[0000.532] I> ECC region[1]: Start:0x0, End:0x0
    12[0000.536] I> ECC region[2]: Start:0x0, End:0x0
    13[0000.540] I> ECC region[3]: Start:0x0, End:0x0
    14[0000.545] I> ECC region[4]: Start:0x0, End:0x0
    15[0000.549] I> Non-ECC region[0]: Start:0x0, End:0x0
    16[0000.553] I> Non-ECC region[1]: Start:0x0, End:0x0
    17[0000.558] I> Non-ECC region[2]: Start:0x0, End:0x0
    18[0000.563] I> Non-ECC region[3]: Start:0x0, End:0x0
    19[0000.567] I> Non-ECC region[4]: Start:0x0, End:0x0
    
  4. The carveout 49 address shifts to avoid the previously recorded bad page.

    1[0000.808] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
    2[0000.815] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
    3[0000.822] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
    4[0000.829] I> allocated(CO:49) base:0xe2cd60000 size:0x10000 align: 0x10000 ← Was previously 0xe2cd70000
    
  5. The kernel memory map correspondingly excludes the bad page.

     1[ 0.000000] node 0: [mem 0x0000000080000000-0x00000000fffdffff]
     2[ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
     3[ 0.000000] node 0: [mem 0x0000000100000000-0x0000000e18f95fff]
     4[ 0.000000] node 0: [mem 0x0000000e18f96000-0x0000000e1922bfff]
     5[ 0.000000] node 0: [mem 0x0000000e1922c000-0x0000000e2670ffff]
     6[ 0.000000] node 0: [mem 0x0000000e26710000-0x0000000e2864ffff]
     7[ 0.000000] node 0: [mem 0x0000000e28650000-0x0000000e2c5fffff]
     8[ 0.000000] node 0: [mem 0x0000000e2c600000-0x0000000e2c7fffff]
     9[ 0.000000] node 0: [mem 0x0000000e2c800000-0x0000000e2cd5ffff]10Does not contain 0xe2cd60000 and 0xe2cd70000 pages
    11[ 0.000000] node 0: [mem 0x0000000e32000000-0x0000000e33ffffff]