Targeted Stress Plugin
Overview
The Targeted Stress plugin is part of the level 3 tests. The plugin maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.
This test is designed to stress the GPU at a specific performance target rather than maximum stress, allowing for controlled performance testing and validation of GPU stability under sustained load.
Test Description
The Targeted Stress plugin performs the following operations:
Matrix Multiplication Operations: Continuously runs GEMM (General Matrix Multiply) operations using cuBLAS, alternating between single precision (SGEMM) and double precision (DGEMM) based on configuration.
Performance Targeting: Monitors actual GPU performance and adjusts the workload to maintain the specified target performance level in GFLOPS.
Multi-Stream Processing: Utilizes multiple CUDA streams per GPU to pipeline operations and maximize GPU utilization while maintaining the target stress level.
Performance Validation: Ensures that the achieved performance meets the minimum ratio threshold of the target performance, accounting for memory transfer overhead.
Health Monitoring: Continuously monitors GPU health metrics including temperature, memory errors, PCIe replays, and other standard error conditions.
Supported Parameters
The following table lists the global parameters for the targeted stress plugin:
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
test_duration |
Double |
30.0 |
Duration of the test in seconds. |
target_stress |
Double |
100.0 |
Target performance level in GFLOPS that the test should maintain. |
target_perf_min_ratio |
Double |
0.95 |
Minimum ratio of achieved performance to target performance (0.0 to 1.0). |
temperature_max |
Double |
Blank |
Maximum allowed temperature in degrees Celsius during the test. |
is_allowed |
Bool |
See description |
Whether the targeted stress test is allowed to run. Must be “True” for the test to execute. Note: When unspecified, DCGM searches for configuration files for the specified GPU; if no configuration is found, the parameter defaults to “False”. |
use_dgemm |
Bool |
True |
Use double precision GEMM (DGEMM) instead of single precision (SGEMM). |
cuda_streams_per_gpu |
Double |
8.0 |
Number of CUDA streams to use per GPU for pipelining operations. |
ops_per_stream_queue |
Double |
100.0 |
Number of operations to queue per stream before waiting for completion. |
max_pcie_replays |
Double |
160.0 |
Maximum allowed PCIe replay count during the test. |
max_memory_clock |
Double |
0.0 |
Maximum allowed memory clock frequency (0.0 = no limit). |
max_graphics_clock |
Double |
0.0 |
Maximum allowed graphics clock frequency (0.0 = no limit). |
max_sbe_errors |
Double |
Blank |
Threshold beyond which SBE (Single Bit Error) count constitutes a failure. |
Sample Commands
Run a quick targeted stress test for 2 minutes:
$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0
Run the targeted stress test targeting 200 GFLOPS performance:
$ dcgmi diag -r targeted_stress -p targeted_stress.target_stress=200.0
Run the targeted stress test with single precision operations:
$ dcgmi diag -r targeted_stress -p targeted_stress.use_dgemm=False
Run the targeted stress test with stricter performance requirements (90% of target):
$ dcgmi diag -r targeted_stress -p targeted_stress.target_perf_min_ratio=0.90
Run the targeted stress test with temperature limit of 85°C:
$ dcgmi diag -r targeted_stress -p targeted_stress.temperature_max=85.0
Run the targeted stress test with custom stream configuration:
$ dcgmi diag -r targeted_stress -p targeted_stress.cuda_streams_per_gpu=4.0 -p targeted_stress.ops_per_stream_queue=50.0
Run the targeted stress test as part of level 3 diagnostics:
$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0
Failure Conditions
The test will fail if achieved performance is below the minimum ratio threshold (
target_perf_min_ratio, 95% by default) of the target performance (target_stress, 100 GFLOPS by default) during the test.The test will fail if unrecoverable memory errors, SBE count exceeds the specified threshold (
max_sbe_errors), temperature violations, or XIDs occur during the test.The test will fail if PCIe replay count exceeds the specified maximum (
max_pcie_replays, 160 by default) during the test.The test will fail if memory or graphics clock frequencies exceed specified limits (
max_memory_clockandmax_graphics_clock) during the test.