Pulse Test Diagnostic
Overview
The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
Test Description
By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.
The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.
Parameter |
Description |
Default |
---|---|---|
test_duration |
Seconds to spend on an iteration. This is not the exact amount of time the test will take. |
60 |
patterns |
Specify a comman-separated list of pattern indices the pulse test should use. Valid indices depend on the type of SKU. Hopper: 0-22 Ampere / Volta/ Ada: 0-20 |
All |
Note
In some cases with DCGM 2.4 and DCGM 3.0, users may encounter the following issue with running the Pulse test:
| Pulse Test | Fail - All |
| Warning | GPU 0There was an internal error during the t |
| | est: 'The pulse test exited with non-zero sta |
| | tus 1', GPU 0There was an internal error duri |
| | ng the test: 'The pulse test reported the err |
| | or: Exception raised during execution: Faile |
| | d opening file ubergemm.log for writing: Perm |
| | ission denied terminate called after throwing |
| | an instance of 'boost::wrapexcept<boost::pro |
| | perty_tree::xml_parser::xml_parser_error>' |
| | what(): result.xml: cannot open file ' |
When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) service account to run the diagnostics. If the service account does not have write access to the directory where diagnostics are run, then users may encounter this issue. To summarize, the issue happens when both these conditions are true:
The
nvidia-dcgm
service is active and thenv-hostengine
process is running (and no changes have been made to DCGM’s default install configurations)The users attempts to run
dcgmi diag -r 4
. In this case,dcgmi diag
connects to the runningnv-hostengine
(which was started by default under/root
) and thus the Pulse test is unable to create any logs.
This issue will be fixed in a future release of DCGM. In the meantime, users can do either of the following to work-around the issue:
Stop the nvidia-dcgm service before running the pulse_test
$ sudo systemctl stop nvidia-dcgm
Now run the
pulse_test
:$ dcgmi diag -r pulse_test
Restart the
nvidia-dcgm
service once the diagnostics are completed:$ sudo systemctl restart nvidia-dcgm
Edit the
systemd
unit service file to include aWorkingDirectory
option, so that the service is started in a location writeable by thenvidia-dcgm
user (be sure that the directory shown in the example below/tmp/dcgm-temp
is created):[Service] ... WorkingDirectory=/tmp/dcgm-temp ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm ...
Reload the systemd configuration and start the
nvidia-dcgm
service:$ sudo systemctl daemon-reload
$ sudo systemctl start nvidia-dcgm
Sample Commands
Run the entire diagnostic suite, including the pulse test:
$ dcgmi diag -r 4
Run just the pulse test:
$ dcgmi diag -r pulse_test
Run just the pulse test, but at a lower frequency:
$ dcgmi diag -r pulse_test -p pulse_test.freq0=3000
Run just the pulse test at a lower frequency and for a shorter time:
$ dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"
Failure Conditions
The pulse test will fail if the power supply unit cannot handle the spikes in the current.
It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.