Pulse Test Diagnostic

Overview

The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.

Test Description

By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.

The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.


Parameter

Description

Default

test_duration

Seconds to spend on an iteration. This is not the exact amount of time the test will take.

60

patterns

Specify a comman-separated list of pattern indices the pulse test should use. Valid indices depend on the type of SKU. Hopper: 0-22 Ampere / Volta/ Ada: 0-20

All

Note

In some cases with DCGM 2.4 and DCGM 3.0, users may encounter the following issue with running the Pulse test:

| Pulse Test | Fail - All |
| Warning | GPU 0There was an internal error during the t |
| | est: 'The pulse test exited with non-zero sta |
| | tus 1', GPU 0There was an internal error duri |
| | ng the test: 'The pulse test reported the err |
| | or: Exception raised during execution: Faile |
| | d opening file ubergemm.log for writing: Perm |
| | ission denied terminate called after throwing |
| | an instance of 'boost::wrapexcept<boost::pro |
| | perty_tree::xml_parser::xml_parser_error>' |
| | what(): result.xml: cannot open file ' |

When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) service account to run the diagnostics. If the service account does not have write access to the directory where diagnostics are run, then users may encounter this issue. To summarize, the issue happens when both these conditions are true:

  1. The nvidia-dcgm service is active and the nv-hostengine process is running (and no changes have been made to DCGM’s default install configurations)

  2. The users attempts to run dcgmi diag -r 4. In this case, dcgmi diag connects to the running nv-hostengine (which was started by default under /root) and thus the Pulse test is unable to create any logs.

This issue will be fixed in a future release of DCGM. In the meantime, users can do either of the following to work-around the issue:

  1. Stop the nvidia-dcgm service before running the pulse_test

    $ sudo systemctl stop nvidia-dcgm
    

    Now run the pulse_test:

    $ dcgmi diag -r pulse_test
    

    Restart the nvidia-dcgm service once the diagnostics are completed:

    $ sudo systemctl restart nvidia-dcgm
    
  2. Edit the systemd unit service file to include a WorkingDirectory option, so that the service is started in a location writeable by the nvidia-dcgm user (be sure that the directory shown in the example below /tmp/dcgm-temp is created):

    [Service]
    
     ...
    
     WorkingDirectory=/tmp/dcgm-temp
     ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    
     ...
    

    Reload the systemd configuration and start the nvidia-dcgm service:

    $ sudo systemctl daemon-reload
    
    $ sudo systemctl start nvidia-dcgm
    

Sample Commands

Run the entire diagnostic suite, including the pulse test:

$ dcgmi diag -r 4

Run just the pulse test:

$ dcgmi diag -r pulse_test

Run just the pulse test, but at a lower frequency:

$ dcgmi diag -r pulse_test -p pulse_test.freq0=3000

Run just the pulse test at a lower frequency and for a shorter time:

$ dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"

Failure Conditions

  • The pulse test will fail if the power supply unit cannot handle the spikes in the current.

  • It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.