Targeted Power Plugin
Overview
The Targeted Power plugin is part of the level 3 and higher tests. Its goal is to drive a GPU towards TDP power usage and sustain that throughout the test in order to ensure that the GPU can perform under a power load. This is achieved by using matrix sizes and gemms that are emprically determined to sustain the required power load on the GPU.
Test Description
This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.
Supported Parameters
The following table lists the global parameters for the targeted power plugin:
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
test_duration |
Double |
120.0 |
This is the time in seconds that the test should run. |
target_power |
Double |
Defaults to the GPU’s Thermal Design Power (TDP) - 1. For example, a GPU with a max power draw of 400.0 W would have a target of 399.0 watts. |
This is the target power that the test is attempting to achieve for this device. |
target_power_min_ratio |
Double |
75.0 |
The minimum percentage of the target power that must be reached for the test to be considered passing. |
use_dgemm |
Bool |
True |
If set to true, the test will use 64 bit precision in its matrix multiplications instead of 32 bit. |
is_allowed |
Bool |
False |
Specifies whether or not this test is allowed to run. |
Sample Commands
Run the power test for 10 minutes:
$ dcgmi diag -r targeted_power -p targeted_power.test_duration=600.0
Run the level 3 diagnostic with a 5 minute targeted power test:
$ dcgmi diag -r 3 -p targeted_power.test_duration=300.0
Run the target power test targeting 200 W of power usage:
$ dcgmi diag -r targeted_power -p targeted_power.target_power=200.0
Run the level 4 test, skipping targeted power:
$ dcgmi diag -r 4 -p targeted_power.is_allowed=false
Run the targeted power test, using single precision (32 bit):
$ dcgmi diag -r targeted_power -p targeted_power.use_dgemm=false
Failure Conditions
The test will fail if we cannot reach at least target_power_min_ratio (75% by default) of the target_power (TDP - 1 by default) during the test.
The test will fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.