GPU Memory Plugin

Overview

The GPU Memory Plugin is a hardware diagnostic test that validates GPU memory integrity and functionality. It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues on NVIDIA GPUs.

Test Description

The memory plugin consists of two main test components:

  1. Main Memory Test

    • Purpose: Tests GPU memory allocation, writing, and reading operations

    • Process:

      • Allocates a significant portion of GPU memory (default 75% of total memory)

      • Writes specific test patterns to memory using CUDA kernels

      • Reads back the data and verifies integrity

      • Uses 5 different test patterns: 0x00, 0xAA, 0x55, 0xFF, 0x00

      • Detects memory mismatches and ECC errors

  2. L1 Cache Test

    • Purpose: Tests L1 cache functionality and detects cache-related issues

    • Requirements:

      • Compute capability 7.0 or higher

      • L1 cache size ≤ 256KB per SM

    • Controlled by parameter l1_is_allowed

    • Process: Performs cache operations and validates data integrity

Supported Parameters

The following table lists the global parameters for the memory plugin.

Parameter Name

Type

Default

Description

is_allowed

Bool

When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false.

Specifies whether or not this test is allowed to run.

The following table lists the parameters for the main memory test.

Parameter Name

Type

Default

Description

minimum_allocation_percentage

Double

75.0

Minimum percentage of GPU memory to allocate (0.0-100.0).

max_free_memory_mb

Double

0.0

Maximum amount of GPU memory which will be allocated for memory testing, specified in megabytes (MB). Supersedes minimum_allocation_percentage. If not specified, the test will allocate value of minimum_allocation_percentage for the GPU memory.

The following table lists the parameters for the l1 cache test.

Parameter Name

Type

Default

Description

l1_is_allowed

Bool

When unspecified, DCGM will search configuration files for an appropriate default for the specified GPU; if no entry is found, the parameter defaults to false.

Enables/disables the L1 cache subtest.

test_loops

Double

0

Number of test loops.

test_duration

Double

1.0

Duration of L1 cache test in seconds (0 = run for test_loops).

inner_iterations

Double

1024

Number of inner iterations per test.

log_len

Double

8192

Length of error log for L1 cache test.

dump_miscompares

Bool

True

Whether to dump miscompare details.

l1cache_size_kb_per_sm

Double

0

L1 cache size per SM in KB.

Sample Commands

Basic Memory Test

$ dcgmi diag -r memory -p "memory.is_allowed=True"

With L1 Cache Test Enabled

$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.l1_is_allowed=True"

Controlling Memory Allocation with max_free_memory_mb

$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=0.1"

$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=512"

$ dcgmi diag -r memory -p "memory.is_allowed=True;memory.max_free_memory_mb=100"

Failure Conditions

  1. Memory Allocation Failure (DCGM_FR_MEMORY_ALLOC)

    • Cannot allocate the minimum required percentage of GPU memory.

  2. Memory Mismatch (DCGM_FR_MEMORY_MISMATCH)

    • Data written to memory doesn’t match data read back

  3. CUDA Double-Bit Error (DCGM_FR_CUDA_DBE)

    • CUDA detects uncorrectable double-bit ECC error

  4. L1 Cache Miscompare (DCGM_FR_L1TAG_MISCOMPARE)

    • L1 cache test detects data corruption