DCGM Diagnostics¶
Overview¶
The NVIDIA Validation Suite (NVVS) is now called DCGM Diagnostics. As of DCGM v1.5, running NVVS as a standalone utility is now deprecated and all the functionality (including command line options) is available via the DCGM command-line utility (‘dcgmi’). For brevity, the rest of the document may use DCGM Diagnostics and NVVS interchangeably.
DCGM Diagnostic Goals¶
DCGM Diagnostics are designed to:
Provide a system-level tool, in production environments, to assess cluster readiness levels before a workload is deployed.
Facilitate multiple run modes:
Interactive via an administrator or user in plain text.
Scripted via another tool with easily parseable output.
Provide multiple test timeframes to facilitate different preparedness or failure conditions:
Level 1 tests to use as a readiness metric
Level 2 tests to use as an epilogue on failure
Level 3 tests to be run by an administrator as post-mortem
Integrate the following concepts into a single tool to discover deployment, system software and hardware configuration issues, basic diagnostics, integration issues, and relative system performance.
Deployment and Software Issues
NVML library access and versioning
CUDA library access and versioning
Software conflicts
Hardware Issues and Diagnostics
Pending Page Retirements
PCIe interface checks
NVLink interface checks
Framebuffer and memory checks
Compute engine checks
Integration Issues
PCIe replay counter checks
Topological limitations
Permissions, driver, and cgroups checks
Basic power and thermal constraint checks
Stress Checks
Power and thermal stress
Throughput stress
Constant relative system performance
Maximum relative system performance
Memory Bandwidth
Provide troubleshooting help
Easily integrate into Cluster Scheduler and Cluster Management applications
Reduce downtime and failed GPU jobs
Beyond the Scope of the DCGM Diagnostics¶
DCGM Diagnostics are not designed to:
Provide comprehensive hardware diagnostics
Actively fix problems
Replace the field diagnosis tools. Please refer to http://docs.nvidia.com/deploy/hw-field-diag/index.html for that process.
Facilitate any RMA process. Please refer to http://docs.nvidia.com/deploy/rma-process/index.html for those procedures.
Overview of Plugins¶
The NVIDIA Validation Suite consists of a series of plugins that are each designed to accomplish a different goal.
Deployment Plugin¶
The deployment plugin’s purpose is to verify the compute environment is ready to run CUDA applications and is able to load the NVML library.
Preconditions¶
LD_LIBRARY_PATH must include the path to the CUDA libraries, which for version X.Y of CUDA is normally
/usr/local/cuda-X.Y/lib64
, which can be set by runningexport LD_LIBRARY_PATH=/usr/local/cuda-X.Y/lib64
The Linux nouveau driver must not be running, and should be blacklisted since it will conflict with the NVIDIA driver
Configuration Parameters¶
None at this time.
Stat Outputs¶
None at this time.
Failure¶
The plugin will fail if:
The corresponding device nodes for the target GPU(s) are being blocked by the operating system (e.g. cgroups) or exist without r/w permissions for the current user.
The NVML library libnvidia-ml.so cannot be loaded
The CUDA runtime libraries cannot be Loaded
The nouveau driver is found to be loaded
Any pages are pending retirement on the target GPU(s)
Any pending row remaps or failed row remappings on the target GPU(s).
Any other graphics processes are running on the target GPU(s) while the plugin runs
PCIe - GPU Bandwidth Plugin¶
The GPU bandwidth plugin’s purpose is to measure the bandwidth and latency to and from the GPUs and the host.
Preconditions¶
None
Sub tests¶
The plugin consists of several self-tests that each measure a different aspect of bandwidth or latency. Each subtest has either a pinned/unpinned pair or a p2p enabled/p2p disabled pair of identical tests. Pinned/unpinned tests use either pinned or unpinned memory when copying data between the host and the GPUs.
This plugin will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe
Each sub test is represented with a tag that is used both for specifying configuration parameters for the sub test and for outputting stats for the sub test. P2p enabled/p2p disabled tests enable or disable GPUs on the same card talking to each other directly rather than through the PCIe bus.
Sub Test Tag |
Pinned/Unpinned P2P Enabled/P2P Disabled |
Description |
---|---|---|
h2d_d2h_single_pinned |
Pinned |
Device <-> Host Bandwidth, one GPU at a time |
h2d_d2h_single_unpinned |
Unpinned |
Device <-> Host Bandwidth, one GPU at a time |
h2d_d2h_latency_pinned |
Pinned |
Device <-> Host Latency, one GPU at a time |
h2d_d2h_latency_unpinned |
Unpinned |
Device <-> Host Latency, one GPU at a time |
p2p_bw_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth one GPU pair at a time |
p2p_bw_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth one GPU pair at a time |
p2p_bw_concurrent_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1 |
p2p_bw_concurrent_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1 |
1d_exch_bw_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l) |
1d_exch_bw_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l) |
p2p_latency_p2p_enabled |
P2P Enabled |
Device <-> Device Latency, one GPU pair at a time |
p2p_latency_p2p_disabled |
P2P Disabled |
Device <-> Device Latency, one GPU pair at a time |
Pulse Test Diagnostic¶
Overview¶
The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
Test Description¶
By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.
The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.
Supported Parameters¶
Parameter |
Description |
Default |
---|---|---|
test_duration |
seconds for an internal step, not full time |
500 |
kernel |
kernel to execute |
sgemm |
exit_on_error |
Exit on error |
0 |
internal_loops |
kernel calls between checks |
1024 |
alpha |
Alpha |
2.0 |
beta |
Beta |
-1.0 |
waves |
Minimum saturation factor |
1 |
k_size |
K value |
4096 |
min_k_size |
Minimum k schmoo value |
32 |
freq0 |
Frequency in Hz |
22000 |
duty0 |
Duty as a fraction |
0.5 |
freq1 |
Frequency in Hz |
22000 |
duty1 |
Duty as a fraction |
0.5 |
sync_timeout |
Wait time when syncing GPUs |
10000 |
random_seed |
Random seed |
0xDEADCAFE |
matrix_size_mode |
standard, max_alloc, square, forced |
standard |
force_m |
Forced M value |
64 |
force_n |
Forced N value |
64 |
inject_errors |
Number of errors to inject in test results |
0 |
debug0 |
Debug flag |
0 |
check_mode |
crc or diff |
diff |
use_curand |
Use curand for random numbers |
1 |
Sample Commands¶
Run the entire diagnostic suite, including the pulse test:
$ dcgmi diag -r 4
Run just the pulse test:
$ dcgmi diag -r pulse_test
Run just the pulse test, but at a lower frequency:
$ dcgmi diag -r pulse_test -p pulse_test.freq0=3000
Run just the pulse test at a lower frequency and for a shorter time:
$ dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"
Failure Conditions¶
The pulse test will fail if the power supply unit cannot handle the spikes in the current.
It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.
Memtest Diagnostic¶
Overview¶
Beginning with 2.4.0 DCGM diagnostics support an additional level 4
diagnostics (-r 4
). The first of these additional diagnostics is memtest.
Similar to
memtest86,
the DCGM memtest will exercise GPU memory with various test patterns.
These patterns each given a separate test and can be enabled and
disabled by administrators.
Test Descriptions¶
Note
Test runtimes refer to average seconds per single iteration on a single A100 40gb GPU.
Test0 [Walking 1 bit] - This test changes one bit at a time in memory to see if it goes to a different memory location. It is designed to test the address wires. Runtime: ~3 seconds.
Test1 [Address check] - Each Memory location is filled with its own address followed by a check to see if the value in each memory location still agrees with the address. Runtime: < 1 second.
Test 2 [Moving inversions, ones&zeros] - This test uses the moving inversions algorithm from memtest86 with patterns of all ones and zeros. Runtime: ~4 seconds.
Test 3 [Moving inversions, 8 bit pat] - Same as test 1 but uses a 8 bit wide pattern of “walking” ones and zeros. Runtime: ~4 seconds.
Test 4 [Moving inversions, random pattern] - Same algorithm as test 1 but the data pattern is a random number and it’s complement. A total of 60 patterns are used. The random number sequence is different with each pass so multiple passes can increase effectiveness. Runtime: ~2 seconds.
Test 5 [Block move, 64 moves] - This test moves blocks of memory. Memory is initialized with shifting patterns that are inverted every 8 bytes. Then these blocks of memory are moved around. After the moves are completed the data patterns are checked. Runtime: ~1 second.
Test 6 [Moving inversions, 32 bit pat] - This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address. To use all possible data patterns 32 passes are made during the test. Runtime: ~155 seconds.
Test 7 [Random number sequence] - A 1MB block of memory is initialized with random patterns. These patterns and their complements are used in moving inversion tests with rest of memory. Runtime: ~2 seconds.
Test 8 [Modulo 20, random pattern] - A random pattern is generated. This pattern is used to set every 20th memory location in memory. The rest of the memory location is set to the compliment of the pattern. Repeat this for 20 times and each time the memory location to set the pattern is shifted right. Runtime: ~10 seconds.
Test 9 [Bit fade test, 2 patterns] - The bit fade test initializes all memory with a pattern and then sleeps for 1 minute. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. Runtime: ~244 seconds.
Test10 [Memory stress] - A random pattern is generated and a large kernel is launched to set all memory to the pattern. A new read and write kernel is launched immediately after the previous write kernel to check if there is any errors in memory and set the memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is written as to achieve the maximum bandwidth between the global memory and GPU. Runtime: ~6 seconds.
Note
By default Test7 and Test10 alternate for a period of 10 minutes. If any errors are detected the diagnostic will fail.
Supported Parameters¶
Parameter |
Syntax |
Default |
---|---|---|
test0 |
boolean |
false |
test1 |
boolean |
false |
test2 |
boolean |
false |
test3 |
boolean |
false |
test4 |
boolean |
false |
test5 |
boolean |
false |
test6 |
boolean |
false |
test7 |
boolean |
true |
test8 |
boolean |
false |
test9 |
boolean |
false |
test10 |
boolean |
true |
test_duration |
seconds |
600 |
Sample Commands¶
Run test7 and test10 for 10 minutes (this is the default):
$ dcgmi diag -r 4
Run each test serially for 1 hour then display results:
$ dcgmi diag -r 4 -p \
memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest.test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest.test9=true\;memtest.test10=true\;memtest.test_duration=3600
Run test0 for one minute 10 times, displaying the results each minute:
$ dcgmi diag --iterations 10 \
-r 4 -p memtest.test0=true\;memtest.test7=false\;memtest.test10=false\;memtest.test_duration=60