Targeted Stress Plugin

Overview

The Targeted Stress plugin is part of the level 3 tests. The plugin maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.

This test is designed to stress the GPU at a specific performance target rather than maximum stress, allowing for controlled performance testing and validation of GPU stability under sustained load.

Test Description

The Targeted Stress plugin performs the following operations:

  1. Matrix Multiplication Operations: Continuously runs GEMM (General Matrix Multiply) operations using cuBLAS, alternating between single precision (SGEMM) and double precision (DGEMM) based on configuration.

  2. Performance Targeting: Monitors actual GPU performance and adjusts the workload to maintain the specified target performance level in GFLOPS.

  3. Multi-Stream Processing: Utilizes multiple CUDA streams per GPU to pipeline operations and maximize GPU utilization while maintaining the target stress level.

  4. Performance Validation: Ensures that the achieved performance meets the minimum ratio threshold of the target performance, accounting for memory transfer overhead.

  5. Health Monitoring: Continuously monitors GPU health metrics including temperature, memory errors, PCIe replays, and other standard error conditions.

Supported Parameters

The following table lists the global parameters for the targeted stress plugin:

Parameter Name

Type

Default

Description

test_duration

Double

30.0

Duration of the test in seconds.

target_stress

Double

100.0

Target performance level in GFLOPS that the test should maintain.

target_perf_min_ratio

Double

0.95

Minimum ratio of achieved performance to target performance (0.0 to 1.0).

temperature_max

Double

Blank

Maximum allowed temperature in degrees Celsius during the test.

is_allowed

Bool

See description

Whether the targeted stress test is allowed to run. Must be “True” for the test to execute. Note: When unspecified, DCGM searches for configuration files for the specified GPU; if no configuration is found, the parameter defaults to “False”.

use_dgemm

Bool

True

Use double precision GEMM (DGEMM) instead of single precision (SGEMM).

cuda_streams_per_gpu

Double

8.0

Number of CUDA streams to use per GPU for pipelining operations.

ops_per_stream_queue

Double

100.0

Number of operations to queue per stream before waiting for completion.

max_pcie_replays

Double

160.0

Maximum allowed PCIe replay count during the test.

max_memory_clock

Double

0.0

Maximum allowed memory clock frequency (0.0 = no limit).

max_graphics_clock

Double

0.0

Maximum allowed graphics clock frequency (0.0 = no limit).

max_sbe_errors

Double

Blank

Threshold beyond which SBE (Single Bit Error) count constitutes a failure.

Sample Commands

Run a quick targeted stress test for 2 minutes:

$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0

Run the targeted stress test targeting 200 GFLOPS performance:

$ dcgmi diag -r targeted_stress -p targeted_stress.target_stress=200.0

Run the targeted stress test with single precision operations:

$ dcgmi diag -r targeted_stress -p targeted_stress.use_dgemm=False

Run the targeted stress test with stricter performance requirements (90% of target):

$ dcgmi diag -r targeted_stress -p targeted_stress.target_perf_min_ratio=0.90

Run the targeted stress test with temperature limit of 85°C:

$ dcgmi diag -r targeted_stress -p targeted_stress.temperature_max=85.0

Run the targeted stress test with custom stream configuration:

$ dcgmi diag -r targeted_stress -p targeted_stress.cuda_streams_per_gpu=4.0 -p targeted_stress.ops_per_stream_queue=50.0

Run the targeted stress test as part of level 3 diagnostics:

$ dcgmi diag -r targeted_stress -p targeted_stress.test_duration=120.0

Failure Conditions

  • The test will fail if achieved performance is below the minimum ratio threshold (target_perf_min_ratio, 95% by default) of the target performance (target_stress, 100 GFLOPS by default) during the test.

  • The test will fail if unrecoverable memory errors, SBE count exceeds the specified threshold (max_sbe_errors), temperature violations, or XIDs occur during the test.

  • The test will fail if PCIe replay count exceeds the specified maximum (max_pcie_replays, 160 by default) during the test.

  • The test will fail if memory or graphics clock frequencies exceed specified limits (max_memory_clock and max_graphics_clock) during the test.