RAS Repair#

GPU Memory Repair#

GPU memory repair is another resiliency feature available on select Blackwell products. GPU memory repair consists of swapping spare DRAM channels or L2 cache slices.

There are multiple L2 slices that make up L2 cache in the NVIDIA GPU.

For example, if there are spare channels available and the DRAM has a bank that is trending towards failure, the channel in which the bank resides can be swapped out for a spare. After two row re-mappings in the same bank, the next occurrence of an uncorrectable ECC error in that bank will attempt to trigger repair if a spare is available. The same concept applies to failing L2 slice as well.

XID 160 will be printed if a DRAM channel or L2 slice is successfully marked for repair. The process of channel repair requires a reboot to take effect.