User Visible Statistics#
Previously, end-users or sysadmins could use the page retirement count to monitor the health of the GPU (and possible RMA conditions) and determine whether to reset the GPU or reload the module for page offlining to take effect. For the same purpose, row-remapping statistics will be exposed to users to give an indication of the health of the GPU memory. This section describes the row-remapping statistics that are available via in-band and out-of-band reporting mechanisms.
In-band reporting
XID error log (see Table 2 for a list of XID log examples for uncorrectable ECC errors)
XID 48: This XID indicates an uncorrectable ECC error has occurred. XID 171 indicates if it is DRAM and XID 172 indicates if it is SRAM
XID 94: This XID indicates a contained ECC error has occurred
XID 95: This XID indicates an uncontained ECC error has occurred
XID 63: This XID indicates successful recording of a row-remapping entry to the InfoROM
XID 64: This XID indicates a failure in recording a row-remapping entry to the InfoROM
XID 160: This XID indicates that GPU memory is successfully marked for repair.
NVML/nvidia-smi
Number of remapped rows (correctable and uncorrectable). This is the number of entries recorded in the InfoROM, not the ones remapped in hardware.
Row remapping pending Boolean
Row remapping failure Boolean
Bucketized counts
Refer to https://docs.nvidia.com/deploy/nvml-api/index.html for more information about NVML.
Refer to https://developer.nvidia.com/nvidia-system-management-interface for more information about nvidia-smi.
Out-of-band reporting (SMBPBI)
Number of remapped rows (correctable and uncorrectable), and this is the number of entries recorded in the InfoROM, not the ones remapped in hardware.
Row remapping pending Boolean
Row remapping failure Boolean
Bucketized counts
Table 3 lists the new SMBPBI APIs for error reporting. For additional details refer to the NVIDIA SMBus Post-Box Interface (SMBPBI) Software Design Guide (DG-06034-002).
Error Type |
XID Log |
---|---|
Contained error with MIG enabled |
NVRM: Xid (PCI:0000:01:00 GPU-I:05): 94, pid=7194, Contained: CE User Channel (0x9). RST: No, D-RST: No |
Contained error with MIG disabled |
NVRM: Xid (PCI:0000:01:00): 94, pid=7062, Contained: CE User Channel (0x9). RST: No, D-RST: No |
Uncontained error |
NVRM: Xid (PCI:0000:01:00): 95, pid=7062, Uncontained: LTC TAG (0x2,0x0). RST: Yes, D-RST: No |
Opcode |
Description |
---|---|
0x1E |
Request ECC statistics (format V6) |
0x20 |
Request row-remapping related statistics |
Table 5 shows an example of bucketized count where all remapping resources are available.
Bucket Name |
Row Remap Availability (shown as reference only) |
Number of Banks per Bucket |
---|---|---|
Max |
8 |
640 |
High |
7 |
0 |
Partial |
2 to 6 |
0 |
Low |
1 |
0 |
None |
0 |
0 |
Note
The Bank Remap Availability column is shown only for reference purposes. The information is not available through APIs or nvidia-smi.
Table 6 shows an example of bucketized count where 635 banks have 8, 3 banks have 7, and 2 banks have one remaining remapping resource available.
Bucket Name |
Row Remap Availability (shown as reference only) |
Number of Banks per Bucket |
---|---|---|
Max |
8 |
635 |
High |
7 |
3 |
Partial |
2 to 6 |
0 |
Low |
1 |
2 |
None |
0 |
0 |
Note
The Bank Remap Availability column is shown only for reference purposes. The information is not available through APIs or nvidia-smi
.