NVIDIA Tegra
NVIDIA DRIVE OS 5.1 Linux SDK

Developer Guide
5.1.6.0 Release


 
DRAM Error-Correcting Code Memory on the Platform
 
ECC Bits Storage
Error Detection
DRAM Size Impact
DRAM Bandwidth Impact
SoC external memory (DRAM) ECC is a safety related feature implemented in the SoC.
DRAM is protected from bit-flip errors using the ECC. The error code uses 10 bits for every 256 bits of data. The ECC code corrects single bit errors and detects double bit errors.
ECC Bits Storage
DRAM width is not extended to cover the ECC bits. Instead, they are stored in-line. For every 7 x 512B regions, there is a 1 x 512B region that stores the ECC bits for the 7 x 512B of data. When ECC is enabled, extra address computation logic determines the actual physical location of the data and the ECC bits.
Error Detection
For every write to the DRAM, the corresponding ECC bits are also updated. For every read from the DRAM, the ECC of the read data is compared with the stored ECC, and any mismatch results in an error. SoC DRAM ECC implementation has the capability to correct single bit errors in data/ECC bits, detect the double bit errors in data/ECC bits, and detect errors in address bits.
Details of error reporting also have the information of the exact “address” in which the error was detected.
DRAM Size Impact
Since the ECC bits are stored in-line with normal data, 1/8 of the actual DRAM size is reserved. For example, in DRIVE Development Platform Pegasus the total DRAM size is reduced from 32GB to 28GB whenever DRAM ECC is enabled.
DRAM Bandwidth Impact
When DRAM ECC is enabled, you have an overhead of read/write ECC bytes along with data bytes. This overhead has an impact in DRAM bandwidth when DRAM ECC is enabled. However, this DRAM bandwidth impact with DRAM ECC enabled boot is proportional to the bandwidth consumption. DRAM bandwidth impact of 10%-12% is visible only when the bandwidth consumption is greater than 100 GBps (GigaBytes per second).
DRAM ECC Enabled Boot on DRIVE Development Platform Pegasus
 
Confirming the DRAM ECC enabled boot
DRAM ECC is disabled by default. The -E option must be passed during flashing to enable DRAM ECC enabled boot.
Flash with the -E option. For example:
./bootburn.sh -B qspi -b e3550b01-t194a -E
Confirming the DRAM ECC enabled boot
Read the register 0x02c11880 and confirm that it is 0xd (indicates DRAM ECC is enabled). 0xC at 0x02c11880 indicates that DRAM ECC is disabled.
Example read from BPMP console:
] dw 0x02c11880 4
0x02c11880: 0000000d
Software Features
 
Staged Scrubbing
Error Handling
Correctable Errors
Uncorrectable Errors
The following sections describe DRAM ECC software features.
Staged Scrubbing
By default, DRAM does not have a valid ECC after power on. The whole DRAM must be populated with valid ECC (init scrubbing) at every boot. Any DRAM location must be init scrubbed before performing a read. Boot KPI has significant impact if you do an init scrub of the whole DRAM at an early stage of boot. The whole DRAM is scrubbed in a "staged" fashion. Each boot stage scrubs the DRAM range needed by the next stage.
Error Handling
DRAM ECC error detection and reporting at the SoC is called as a single bit error correct (SEC) and double bit error detect (DED).
Correctable Errors
Single bit errors at data and ECC bits are correctable. Correctable errors are actually auto-corrected by hardware and the corrected data is returned back to the requestor (whoever made the read request to the physical page). However, actual corrupted data in the DRAM must be corrected from the software. Correctable errors are indicated as normal interrupts to the SCE (safety cluster) and L2SS running on the SCE performs "demand scrubbing" to correct the errors at the error reported physical page.
Uncorrectable Errors
Double bit errors at data and ECC bits are uncorrectable. Other uncorrectable errors are poison bit errors and address bit errors. Uncorrectable errors are routed to the hardware safety manager (HSM) and handled as a part of 3LSS error handling.
Diagnostics
SoC DRAM ECC implementation in the hardware has built in error injection capabilities to test the detection of a single bit, double bit error in data, and ECC bits. DRAM ECC diagnostics are achieved by injecting errors using hardware-provided error injection capabilities at the bootloader and confirm the detection of injected errors at L2SS boot. DRAM ECC diagnostics are performed during every boot to confirm that the DRAM ECC hardware is operational.