Targeted Stress Plugin#

Overview#

The Targeted Stress plugin is part of the level 3 tests. The plugin maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.

This test is designed to stress the GPU at a specific performance target rather than maximum stress, allowing for controlled performance testing and validation of GPU stability under sustained load.

Test Description#

The Targeted Stress plugin performs the following operations:

Matrix Multiplication Operations: Continuously runs GEMM (General Matrix Multiply) operations using cuBLAS, alternating between single precision (SGEMM) and double precision (DGEMM) based on configuration.
Performance Targeting: Monitors actual GPU performance and adjusts the workload to maintain the specified target performance level in GFLOPS.
Multi-Stream Processing: Utilizes multiple CUDA streams per GPU to pipeline operations and maximize GPU utilization while maintaining the target stress level.
Performance Validation: Ensures that the achieved performance meets the minimum ratio threshold of the target performance, accounting for memory transfer overhead.
Health Monitoring: Continuously monitors GPU health metrics including temperature, memory errors, PCIe replays, and other standard error conditions.

Supported Parameters#

The following table lists the global parameters for the targeted stress plugin:

Parameter Name	Type	Default	Description
test_duration	Double	30.0	Duration of the test in seconds.
target_stress	Double	100.0	Target performance level in GFLOPS that the test should maintain.
target_perf_min_ratio	Double	0.95	Minimum ratio of achieved performance to target performance (0.0 to 1.0).
temperature_max	Double	Blank	Maximum allowed temperature in degrees Celsius during the test.
is_allowed	Bool	See description	Whether the targeted stress test is allowed to run. Must be “True” for the test to execute. Note: When unspecified, DCGM searches for configuration files for the specified GPU; if no configuration is found, the parameter defaults to “False”.
use_dgemm	Bool	True	Use double precision GEMM (DGEMM) instead of single precision (SGEMM).
cuda_streams_per_gpu	Double	8.0	Number of CUDA streams to use per GPU for pipelining operations.
ops_per_stream_queue	Double	100.0	Number of operations to queue per stream before waiting for completion.
max_pcie_replays	Double	160.0	Maximum allowed PCIe replay count during the test.
max_memory_clock	Double	0.0	Maximum allowed memory clock frequency (0.0 = no limit).
max_graphics_clock	Double	0.0	Maximum allowed graphics clock frequency (0.0 = no limit).
max_sbe_errors	Double	Blank	Threshold beyond which SBE (Single Bit Error) count constitutes a failure.

Sample Commands#

Run a quick targeted stress test for 2 minutes:

$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0

Run the targeted stress test targeting 200 GFLOPS performance:

$ dcgmi diag -r targeted_stress -p targeted_stress.target_stress=200.0

Run the targeted stress test with single precision operations:

$ dcgmi diag -r targeted_stress -p targeted_stress.use_dgemm=False

Run the targeted stress test with stricter performance requirements (90% of target):

$ dcgmi diag -r targeted_stress -p targeted_stress.target_perf_min_ratio=0.90

Run the targeted stress test with temperature limit of 85°C:

$ dcgmi diag -r targeted_stress -p targeted_stress.temperature_max=85.0

Run the targeted stress test with custom stream configuration:

$ dcgmi diag -r targeted_stress -p targeted_stress.cuda_streams_per_gpu=4.0 -p targeted_stress.ops_per_stream_queue=50.0

Run the targeted stress test as part of level 3 diagnostics:

$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0

Failure Conditions#

The test will fail if achieved performance is below the minimum ratio threshold (target_perf_min_ratio, 95% by default) of the target performance (target_stress, 100 GFLOPS by default) during the test.
The test will fail if unrecoverable memory errors, SBE count exceeds the specified threshold (max_sbe_errors), temperature violations, or XIDs occur during the test.
The test will fail if PCIe replay count exceeds the specified maximum (max_pcie_replays, 160 by default) during the test.
The test will fail if memory or graphics clock frequencies exceed specified limits (max_memory_clock and max_graphics_clock) during the test.