Targeted Power Plugin

Overview

The Targeted Power plugin is part of the level 3 and higher tests. Its goal is to drive a GPU towards TDP power usage and sustain that throughout the test in order to ensure that the GPU can perform under a power load. This is achieved by using matrix sizes and gemms that are emprically determined to sustain the required power load on the GPU.

Test Description

This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.

Supported Parameters

The following table lists the global parameters for the targeted power plugin:

Parameter Name

Type

Default

Description

test_duration

Double

120.0

This is the time in seconds that the test should run.

target_power

Double

Defaults to the GPU’s Thermal Design Power (TDP) - 1. For example, a GPU with a max power draw of 400.0 W would have a target of 399.0 watts.

This is the target power that the test is attempting to achieve for this device.

target_power_min_ratio

Double

75.0

The minimum percentage of the target power that must be reached for the test to be considered passing.

use_dgemm

Bool

True

If set to true, the test will use 64 bit precision in its matrix multiplications instead of 32 bit.

is_allowed

Bool

False

Specifies whether or not this test is allowed to run.

Sample Commands

Run the power test for 10 minutes:

$ dcgmi diag -r targeted_power -p targeted_power.test_duration=600.0

Run the level 3 diagnostic with a 5 minute targeted power test:

$ dcgmi diag -r 3 -p targeted_power.test_duration=300.0

Run the target power test targeting 200 W of power usage:

$ dcgmi diag -r targeted_power -p targeted_power.target_power=200.0

Run the level 4 test, skipping targeted power:

$ dcgmi diag -r 4 -p targeted_power.is_allowed=false

Run the targeted power test, using single precision (32 bit):

$ dcgmi diag -r targeted_power -p targeted_power.use_dgemm=false

Failure Conditions

  • The test will fail if we cannot reach at least target_power_min_ratio (75% by default) of the target_power (TDP - 1 by default) during the test.

  • The test will fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.