Memtest Diagnostic

Overview

Beginning with 2.4.0 DCGM diagnostics support an additional level 4 diagnostics (-r 4). The first of these additional diagnostics is memtest. Similar to memtest86, the DCGM memtest will exercise GPU memory with various test patterns. These patterns each given a separate test and can be enabled and disabled by administrators.

Test Descriptions

Note

Test runtimes refer to average seconds per single iteration on a single A100 40gb GPU.

Test0 [Walking 1 bit] - This test changes one bit at a time in memory to see if it goes to a different memory location. It is designed to test the address wires. Runtime: ~3 seconds.

Test1 [Address check] - Each Memory location is filled with its own address followed by a check to see if the value in each memory location still agrees with the address. Runtime: < 1 second.

Test 2 [Moving inversions, ones&zeros] - This test uses the moving inversions algorithm from memtest86 with patterns of all ones and zeros. Runtime: ~4 seconds.

Test 3 [Moving inversions, 8 bit pat] - Same as test 1 but uses a 8 bit wide pattern of “walking” ones and zeros. Runtime: ~4 seconds.

Test 4 [Moving inversions, random pattern] - Same algorithm as test 1 but the data pattern is a random number and it’s complement. A total of 60 patterns are used. The random number sequence is different with each pass so multiple passes can increase effectiveness. Runtime: ~2 seconds.

Test 5 [Block move, 64 moves] - This test moves blocks of memory. Memory is initialized with shifting patterns that are inverted every 8 bytes. Then these blocks of memory are moved around. After the moves are completed the data patterns are checked. Runtime: ~1 second.

Test 6 [Moving inversions, 32 bit pat] - This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address. To use all possible data patterns 32 passes are made during the test. Runtime: ~155 seconds.

Test 7 [Random number sequence] - A 1MB block of memory is initialized with random patterns. These patterns and their complements are used in moving inversion tests with rest of memory. Runtime: ~2 seconds.

Test 8 [Modulo 20, random pattern] - A random pattern is generated. This pattern is used to set every 20th memory location in memory. The rest of the memory location is set to the compliment of the pattern. Repeat this for 20 times and each time the memory location to set the pattern is shifted right. Runtime: ~10 seconds.

Test 9 [Bit fade test, 2 patterns] - The bit fade test initializes all memory with a pattern and then sleeps for 1 minute. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. Runtime: ~244 seconds.

Test10 [Memory stress] - A random pattern is generated and a large kernel is launched to set all memory to the pattern. A new read and write kernel is launched immediately after the previous write kernel to check if there is any errors in memory and set the memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is written as to achieve the maximum bandwidth between the global memory and GPU. Runtime: ~6 seconds.

Note

By default Test7 and Test10 alternate for a period of 10 minutes. If any errors are detected the diagnostic will fail.

Supported Parameters

Parameter

Syntax

Default

test0

boolean

false

test1

boolean

false

test2

boolean

false

test3

boolean

false

test4

boolean

false

test5

boolean

false

test6

boolean

false

test7

boolean

true

test8

boolean

false

test9

boolean

false

test10

boolean

true

test_duration

seconds

600

Sample Commands

Run test7 and test10 for 10 minutes (this is the default):

dcgmi diag -r 4

Run each test serially for 1 hour then display results:

dcgmi diag -r 4 -p memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest.test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest.test9=true\;memtest.test10=true\;memtest.test_duration=3600

Run test0 for one minute 10 times, displaying the results each minute:

dcgmi diag --iterations 10 -r 4 -p memtest.test0=true\;memtest.test7=false\;memtest.test10=false\;memtest.test_duration=60