DRAM-ECC (Thor)#

Overview#

DRAM-ECC (Error Correction Code) is a critical feature designed to detect and correct memory errors, ensuring system reliability and data integrity over the product’s lifetime. This documentation provides a comprehensive overview of the IGX Thor DRAM-ECC feature as implemented on the IGX Thor Developer Kit Mini board, including architecture, supported modes, configuration, error handling, validation methodology, and key differences from IGX Orin platforms.

What is DRAM-ECC?#

  • Purpose: DRAM-ECC protects against data corruption in DRAM by detecting and correcting single-bit errors (SBE) and detecting double-bit errors (DBE).

  • Mechanism: For every 32 bytes of data, 2 bytes are reserved for ECC. On every write, ECC is calculated and stored; on every read, ECC is checked. If a mismatch is detected, hardware and software mechanisms are triggered to handle the error.

Supported ECC Modes on T264#

  • No ECC: ECC is disabled; all DRAM is available for software.

  • Inline ECC: Supported on all DRAM chips. 1/16th of DRAM is reserved for ECC storage, reducing available memory. This is the legacy mode.

  • Alt-Link ECC: NVIDIA proprietary mode, supported by select Micron DRAMs. All DRAM is available for software; ECC is stored in a dedicated area within the DRAM device. Offers better bandwidth than inline ECC due to a dedicated data path.

  • On L4T SKU8 and other automotive platforms, only Alt-Link ECC is supported.

Architecture & Design#

Hardware Components#

  • Memory Controller (MC): Calculates and stores ECC bits on every write; checks ECC on every read.

  • Hardware Safety Manager (HSM): Captures ECC errors and forwards them to the Functional Safety Island (FSI) firmware.

  • FSI: Handles error interrupts, manages error reporting, and coordinates page retirement.

Software Components#

  • Bootloader (MB1/MB2): Manages bad page lists, memory carveouts, and ECC configuration.

  • Bad Page Partition (PRL): Stores addresses of retired (bad) DRAM pages in QSPI flash. Two copies (primary and secondary) are maintained for redundancy.

  • Patrol Scrubbing: FSI firmware uses the HSS engine to continuously patrol DRAM, detecting and correcting SBEs and triggering page retirement for DBEs.

PRL (Page Retirement List) and Bad Page Management#

  • PRL Storage: Two copies (primary and secondary) are stored in QSPI flash.

  • PRL Update Flow: On DBE, FSI triggers a reset; MB1/MB2 update the PRL with the new bad page address, encrypt and sign the updated list, and write it back to both PRL partitions.

  • Bootloader Handling: On boot, MB1 reads the PRL and excludes bad pages from memory allocation. MB2 passes PRL status to FSI through scratch registers.

Error Injection (for Development/Validation)#

  • Purpose: Allows controlled injection of SBE/DBE for validation.

  • Mechanism: Errors are injected into a dedicated carveout (CARVEOUT_DRAM_ECC_TEST) at specific offsets (0x1000 for SBE, 0x8000 for DBE).

  • Enablement: Controlled using the ENABLE_DRAM_ECC_SBE and ENABLE_DRAM_ECC_DBE flags.

  • FSI Shell: The ecc_test dbe/sbe command can be used to trigger DBE injection (requires FSI support and correct carveout access).

Feature Flow#

Initialization#

  • DRAM Scrubbing: During boot, MB1 and MB2 perform staged or full scrubbing of DRAM to initialize ECC bits.

  • ECC Enablement: ECC is enabled using configuration flags in the BCT (Board Configuration Table). For Alt-Link ECC, the relevant register is EMC_FBIO_CFG9.ALT_LINK_ECC_EN.

Error Handling#

  • Single Bit Error (SBE): Corrected in hardware; FSI logs the event but no system action is required.

  • Double Bit Error (DBE): Uncorrectable. FSI captures the error, triggers a system reset, and records the bad page address for retirement.

  • Page Retirement: On reboot, MB1/MB2 update the PRL with the new bad page, ensuring it is excluded from future memory allocations.

Patrol Scrubbing#

(available in GA)

  • Continuous Scrubbing: FSI uses the HSS engine to perform background read-modify-write operations across DRAM, proactively detecting errors before they impact applications.

  • SC7/Low Power Handling: Scrubbing is paused during deep sleep (SC7) and resumes on wake, with FSI tracking progress to avoid redundant operations.

Validation#

Test Setup#

  • Enable the injection flags in the conf file to inject the SBE and DBE faults in MB1.

  • Example: COLDBOOT_BCT_DEFINES="ENABLE_DRAM_ECC_PRL;ENABLE_DRAM_ECC_SBE;ENABLE_DRAM_ECC_DBE".

Validation Logs and Flow#

To validate the DRAM-ECC feature, do the following checks:

  • MB1 should print the injection success logs:

[001.855] I> Task: Dram ecc test
[0001.858] I> Test Inject ecc err passed for type 0
[0001.863] I> Test Inject ecc err passed for type 1
[0001.867] I> DRAM ECC Error Inject: SUCCESS
  • FSI should enable the PRL flow and initialize the ECC task:

PRL is enabled
DramECC Initialization success
NvHsm_Init : module initialized
  • Run the ECC test in FSI console:

] ecc_test dbe

=== ECC Test Carveout Info ===
Combined Carveout Base Address: 0x1fbeca0000

*lPtr:0x6c3b3cf3
] Bad Page Address: 0x1fbeca
Trigger PMC Reset.
  • Upon reboot, the MB1 should detect the bad page and update the QSPI bad page blob with the same:

[0000.425] I> Task: Load Page retirement list
[0000.428] I> PRL is enabled
[0000.431] I> Slot: 0
[0000.433] I> Binary[8] block-128640 (partition size: 0x80000)
[0000.439] I> Binary name: DRAM bad page list (P)
[0000.443] I> Size of crypto header is 8192
[0000.447] I> Size of crypto header is 8192
[0000.451] I> BCH of DRAM bad page list (P) read from storage
[0000.456] I> component binary type is 8
[0000.461] I> DRAM bad page list (P) header integrity check is success
[0000.486] I> DRAM bad page list (P) binary is read from storage
[0000.492] I> DRAM bad page list (P) binary integrity check is success
[0000.501] I> Num of PRL entries: 0
[0000.501] I> PRL Bad Page Count = 1
[0000.505] I> bad page addr 0: 0x1fbeca0000