RMA Policy#

GPU DRAM Memory RMA Policy#

The NVIDIA Field Diagnostic tool determines whether a GPU qualifies for RMA. Regarding row-remapping failures, the RMA criteria is met when the row-remapping failure flag is set and validated by the field diagnostic.

For Blackwell GPUs, a third remapping attempt for an uncorrectable memory error on a bank invokes HBM channel repair if there is a spare channel available. If the channel repair is successful, the hardware will be recovered and ready to use. Otherwise, the GPU will continue remapping until Row-Remap Failure flag is set.

Any of the following events will trigger a row-remapping failure flag:

  • A remapping attempt for an uncorrectable memory error on a bank that already has eight uncorrectable error rows remapped.

  • A remapping attempt for an uncorrectable memory error on a row that was already remapped and can occur with less than eight total remaps to the same bank.

  • After 512 total remappings for an uncorrectable memory error have occurred.

The row-remapping failure flag is available through in-band (NVML/nvidia-smi) and out-of-band (SMBPBI) tools.

GPU L2 Memory RMA Policy#

The NVIDIA Field Diagnostic tool will determine whether a GPU qualifies for RMA. Regarding SRAM uncorrectable errors, the RMA criteria is met for events outlined below.

Any of the following events will trigger the SRAM Threshold Exceeded flag:

  • More than 4 UCE Unique Count events within an address bank for parity protected SRAMs.

  • More than 2 UCE Unique Count events within an address bank for SECDED ECC protected SRAMs.