Dynamic Page Offlining#
Dynamic Page Offlining improves resiliency and availability of NVIDIA GPUs to uncorrectable ECC errors. Once the NVIDIA driver identifies the location of an uncorrectable error in the frame buffer memory, it marks the page containing the error as unusable. Once the page is marked unusable, any of the currently executing or newly launched workloads will not be allocating this page in question.
Dynamic Page Offlining exists on NVIDIA GPUs starting with the NVIDIA Ampere architecture. It is not available on previous generations of NVIDIA GPUs that do not support error containment.
GPUs that support dynamic page offlining do not require a GPU reset to recover from most uncorrectable ECC errors.
After the page is marked as unusable, it will not be mapped to the address space of any currently running or newly launched CUDA kernels.