Pulse Test Diagnostic

Overview

The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.

Test Description

By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.

The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.

Supported Parameters

Parameter

Description

Default

test_duration

seconds for an internal step, not full time

500

kernel

kernel to execute

sgemm

exit_on_error

Exit on error

0

internal_loops

kernel calls between checks

1024

alpha

Alpha

2.0

beta

Beta

-1.0

waves

Minimum saturation factor

1

k_size

K value

4096

min_k_size

Minimum k schmoo value

32

freq0

Frequency in Hz

22000

duty0

Duty as a fraction

0.5

freq1

Frequency in Hz

22000

duty1

Duty as a fraction

0.5

sync_timeout

Wait time when syncing GPUs

10000

random_seed

Random seed

0xDEADCAFE

matrix_size_mode

standard, max_alloc, square, forced

standard

force_m

Forced M value

64

force_n

Forced N value

64

inject_errors

Number of errors to inject in test results

0

debug0

Debug flag

0

check_mode

crc or diff

diff

use_curand

Use curand for random numbers

1

Note

In some cases with DCGM 2.4 and DCGM 3.0, users may encounter the following issue with running the Pulse test:

| Pulse Test | Fail - All |
| Warning | GPU 0There was an internal error during the t |
| | est: 'The pulse test exited with non-zero sta |
| | tus 1', GPU 0There was an internal error duri |
| | ng the test: 'The pulse test reported the err |
| | or: Exception raised during execution: Faile |
| | d opening file ubergemm.log for writing: Perm |
| | ission denied terminate called after throwing |
| | an instance of 'boost::wrapexcept<boost::pro |
| | perty_tree::xml_parser::xml_parser_error>' |
| | what(): result.xml: cannot open file ' |

When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) service account to run the diagnostics. If the service account does not have write access to the directory where diagnostics are run, then users may encounter this issue. To summarize, the issue happens when both these conditions are true:

  1. The nvidia-dcgm service is active and the nv-hostengine process is running (and no changes have been made to DCGM’s default install configurations)

  2. The users attempts to run dcgmi diag -r 4. In this case, dcgmi diag connects to the running nv-hostengine (which was started by default under /root) and thus the Pulse test is unable to create any logs.

This issue will be fixed in a future release of DCGM. In the meantime, users can do either of the following to work-around the issue:

  1. Stop the nvidia-dcgm service before running the pulse_test

    $ sudo systemctl stop nvidia-dcgm
    

    Now run the pulse_test:

    $ dcgmi diag -r pulse_test
    

    Restart the nvidia-dcgm service once the diagnostics are completed:

    $ sudo systemctl start nvidia-dcgm
    
  2. Edit the systemd unit service file to include a WorkingDirectory option, so that the service is started in a location writeable by the nvidia-dcgm user (be sure that the directory shown in the example below /tmp/dcgm-temp is created):

    [Service]
    
     ...
    
     WorkingDirectory=/tmp/dcgm-temp
     ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    
     ...
    

    Reload the systemd configuration and start the nvidia-dcgm service:

    $ sudo systemctl daemon-reload
    
    $ sudo systemctl start nvidia-dcgm
    

Sample Commands

Run the entire diagnostic suite, including the pulse test:

dcgmi diag -r 4

Run just the pulse test:

dcgmi diag -r pulse_test

Run just the pulse test, but at a lower frequency:

dcgmi diag -r pulse_test -p pulse_test.freq0=3000

Run just the pulse test at a lower frequency and for a shorter time:

dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"

Failure Conditions

  • The pulse test will fail if the power supply unit cannot handle the spikes in the current.

  • It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.