User Visible Statistics#

Previously, end-users or sysadmins could use the page retirement count to monitor the health of the GPU (and possible RMA conditions) and determine whether to reset the GPU or reload the module for page offlining to take effect. For the same purpose, row-remapping statistics will be exposed to users to give an indication of the health of the GPU memory. This section describes the row-remapping statistics that are available via in-band and out-of-band reporting mechanisms.

  • In-band reporting

    • XID error log (see Table 2 for a list of XID log examples for uncorrectable ECC errors)

      • XID 48: This XID indicates an uncorrectable ECC error has occurred. XID 171 indicates if it is DRAM and XID 172 indicates if it is SRAM

      • XID 94: This XID indicates a contained ECC error has occurred

      • XID 95: This XID indicates an uncontained ECC error has occurred

      • XID 63: This XID indicates successful recording of a row-remapping entry to the InfoROM

      • XID 64: This XID indicates a failure in recording a row-remapping entry to the InfoROM

      • XID 160: This XID indicates that GPU memory is successfully marked for repair.

    • NVML/nvidia-smi

  • Out-of-band reporting (SMBPBI)

    • Number of remapped rows (correctable and uncorrectable), and this is the number of entries recorded in the InfoROM, not the ones remapped in hardware.

    • Row remapping pending Boolean

    • Row remapping failure Boolean

    • Bucketized counts

    • Table 3 lists the new SMBPBI APIs for error reporting. For additional details refer to the NVIDIA SMBus Post-Box Interface (SMBPBI) Software Design Guide (DG-06034-002).

Table 3 Uncorrectable ECC Errors XID Log Examples#

Error Type

XID Log

Contained error with MIG enabled

NVRM: Xid (PCI:0000:01:00 GPU-I:05): 94, pid=7194, Contained: CE User Channel (0x9). RST: No, D-RST: No

Contained error with MIG disabled

NVRM: Xid (PCI:0000:01:00): 94, pid=7062, Contained: CE User Channel (0x9). RST: No, D-RST: No

Uncontained error

NVRM: Xid (PCI:0000:01:00): 95, pid=7062, Uncontained: LTC TAG (0x2,0x0). RST: Yes, D-RST: No

Table 4 SMBPBI APIs for NVIDIA A100 Memory Error Reporting#

Opcode

Description

0x1E

Request ECC statistics (format V6)

0x20

Request row-remapping related statistics

Table 5 shows an example of bucketized count where all remapping resources are available.

Table 5 Bucketized Count Example #1#

Bucket Name

Row Remap Availability (shown as reference only)

Number of Banks per Bucket

Max

8

640

High

7

0

Partial

2 to 6

0

Low

1

0

None

0

0

Note

The Bank Remap Availability column is shown only for reference purposes. The information is not available through APIs or nvidia-smi.

Table 6 shows an example of bucketized count where 635 banks have 8, 3 banks have 7, and 2 banks have one remaining remapping resource available.

Table 6 Bucketized Count Example #2#

Bucket Name

Row Remap Availability (shown as reference only)

Number of Banks per Bucket

Max

8

635

High

7

3

Partial

2 to 6

0

Low

1

2

None

0

0

Note

The Bank Remap Availability column is shown only for reference purposes. The information is not available through APIs or nvidia-smi.