RMA Policy#
GPU DRAM Memory RMA Policy#
The NVIDIA Field Diagnostic tool determines whether a GPU qualifies for RMA. Regarding row-remapping failures, the RMA criteria is met when the row-remapping failure flag is set and validated by the field diagnostic.
For Blackwell GPUs, a third remapping attempt for an uncorrectable memory error on a bank invokes HBM channel repair if there is a spare channel available. If the channel repair is successful, the hardware will be recovered and ready to use. Otherwise, the GPU will continue remapping until Row-Remap Failure flag is set.
Any of the following events will trigger a row-remapping failure flag:
A remapping attempt for an uncorrectable memory error on a bank that already has eight uncorrectable error rows remapped.
A remapping attempt for an uncorrectable memory error on a row that was already remapped and can occur with less than eight total remaps to the same bank.
After 512 total remappings for an uncorrectable memory error have occurred.
The row-remapping failure flag is available through in-band (NVML/nvidia-smi) and out-of-band (SMBPBI) tools.
GPU L2 Memory RMA Policy#
The NVIDIA Field Diagnostic tool will determine whether a GPU qualifies for RMA. Regarding SRAM uncorrectable errors, the RMA criteria is met for events outlined below.
Any of the following events will trigger the SRAM Threshold Exceeded flag:
More than 4 UCE Unique Count events within an address bank for parity protected SRAMs.
More than 2 UCE Unique Count events within an address bank for SECDED ECC protected SRAMs.