Dynamic Page Retirement

The memory used in Tesla boards undergoes a rigorous screening and testing process during manufacturing. When deployed in the field, there is still the likelihood that some memory cells may degrade over time. Beginning with Release 319, the NVIDIA Professional Driver can detect the weak memory cells and retire the associated page ensuring the longevity and reliability of Tesla products.

When a memory cell experiences a double-bit error (DBE) or two consecutive single-bit errors (SBEs), the driver retires the associated page. This feature is called Dynamic Page Retirement. The retired page is no longer available for use for the driver or application to avoid any runtime corruption of data.

Dynamic page retirement is built upon ECC and therefore works only when the customer enables ECC on Tesla products.

The amount of memory that can get retired (~256kb) over a product lifetime is insignificant compared to the total on-board memory, i.e. > 4GB. The feature has no impact on the performance and in turn improves the data reliability of Tesla products. It is an important resiliency feature especially in HPC and enterprise environments.