GPU Memory Plugin
Overview
The GPU Memory Plugin is a hardware diagnostic test that validates GPU memory integrity and functionality. It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues on NVIDIA GPUs.
Test Description
The memory plugin consists of two main test components:
Main Memory Test
Purpose: Tests GPU memory allocation, writing, and reading operations
Process:
Allocates a significant portion of GPU memory (default 75% of total memory)
Writes specific test patterns to memory using CUDA kernels
Reads back the data and verifies integrity
Uses 5 different test patterns: 0x00, 0xAA, 0x55, 0xFF, 0x00
Detects memory mismatches and ECC errors
L1 Cache Test
Purpose: Tests L1 cache functionality and detects cache-related issues
Requirements:
Compute capability 7.0 or higher
L1 cache size ≤ 256KB per SM
Controlled by parameter
l1_is_allowedProcess: Performs cache operations and validates data integrity
Supported Parameters
The following table lists the global parameters for the memory plugin.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
is_allowed |
Bool |
When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false. |
Specifies whether or not this test is allowed to run. |
The following table lists the parameters for the main memory test.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
minimum_allocation_percentage |
Double |
75.0 |
Minimum percentage of GPU memory to allocate (0.0-100.0). |
max_free_memory_mb |
Double |
0.0 |
Maximum amount of GPU memory which will be allocated for memory testing, specified in megabytes (MB). Supersedes minimum_allocation_percentage. If not specified, the test will allocate value of minimum_allocation_percentage for the GPU memory. |
The following table lists the parameters for the l1 cache test.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
l1_is_allowed |
Bool |
When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false. |
Enables/disables the L1 cache subtest. |
test_loops |
Double |
0 |
Number of test loops. |
test_duration |
Double |
1.0 |
Duration of L1 cache test in seconds (0 = run for test_loops). |
inner_iterations |
Double |
1024 |
Number of inner iterations per test. |
log_len |
Double |
8192 |
Length of error log for L1 cache test. |
dump_miscompares |
Bool |
True |
Whether to dump miscompare details. |
l1cache_size_kb_per_sm |
Double |
0 |
L1 cache size per SM in KB. |
Sample Commands
Basic Memory Test
$ dcgmi diag -r memory -p "memory.is_allowed=True"
With L1 Cache Test Enabled
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.l1_is_allowed=True"
Controlling Memory Allocation with max_free_memory_mb
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=0.1"
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=512"
$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=100"
Failure Conditions
Memory Allocation Failure (
DCGM_FR_MEMORY_ALLOC)Cannot allocate the minimum required percentage of GPU memory.
Memory Mismatch (
DCGM_FR_MEMORY_MISMATCH)Data written to memory doesn’t match data read back
CUDA Double-Bit Error (
DCGM_FR_CUDA_DBE)CUDA detects uncorrectable double-bit ECC error
L1 Cache Miscompare (
DCGM_FR_L1TAG_MISCOMPARE)L1 cache test detects data corruption