GPU Memory Plugin#
Overview#
The GPU Memory Plugin is a hardware diagnostic test that validates GPU memory integrity and functionality. It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues on NVIDIA GPUs.
Test Description#
The memory plugin consists of two main test components:
Main Memory Test
Purpose: Tests GPU memory allocation, writing, and reading operations
Process:
Allocates a significant portion of GPU memory (default 75% of total memory)
Writes specific test patterns to memory using CUDA kernels
Reads back the data and verifies integrity
Uses 5 different test patterns: 0x00, 0xAA, 0x55, 0xFF, 0x00
Detects memory mismatches and ECC errors
L1 Cache Test
Purpose: Tests L1 cache functionality and detects cache-related issues
Requirements:
Compute capability 7.0 or higher
L1 cache size ≤ 256KB per SM
Controlled by parameter
l1_is_allowedProcess: Performs cache operations and validates data integrity
Supported Parameters#
The following table lists the global parameters for the memory plugin.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
is_allowed |
Bool |
When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false. |
Specifies whether or not this test is allowed to run. |
The following table lists the parameters for the main memory test.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
minimum_allocation_percentage |
Double |
75.0 |
Minimum percentage of GPU memory to allocate (0.0-100.0). |
max_free_memory_mb |
Double |
0.0 |
Maximum amount of GPU memory which will be allocated for memory testing, specified in megabytes (MB). Supersedes minimum_allocation_percentage. If not specified, the test will allocate value of minimum_allocation_percentage for the GPU memory. |
The following table lists the parameters for the l1 cache test.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
l1_is_allowed |
Bool |
When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false. |
Enables/disables the L1 cache subtest. |
test_loops |
Double |
0 |
Number of test loops. |
test_duration |
Double |
1.0 |
Duration of L1 cache test in seconds (0 = run for test_loops). |
inner_iterations |
Double |
1024 |
Number of inner iterations per test. |
log_len |
Double |
8192 |
Length of error log for L1 cache test. |
dump_miscompares |
Bool |
True |
Whether to dump miscompare details. |
l1cache_size_kb_per_sm |
Double |
0 |
L1 cache size per SM in KB. |
Sample Commands#
Basic Memory Test
$ dcgmi diag -r memory -p "memory.is_allowed=True"
With L1 Cache Test Enabled
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.l1_is_allowed=True"
Controlling Memory Allocation with max_free_memory_mb
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=0.1"
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=512"
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=100"
Failure Conditions#
Memory Allocation Failure (
DCGM_FR_MEMORY_ALLOC)Cannot allocate the minimum required percentage of GPU memory.
Memory Mismatch (
DCGM_FR_MEMORY_MISMATCH)Data written to memory doesn’t match data read back
CUDA Double-Bit Error (
DCGM_FR_CUDA_DBE)CUDA detects uncorrectable double-bit ECC error
L1 Cache Miscompare (
DCGM_FR_L1TAG_MISCOMPARE)L1 cache test detects data corruption