Debugging and Troubleshooting

General Problem Reporting

When reporting a problem, please always include:

  • nvidia-bug-report.log.gz - produced by nvidia-bug-report.sh

  • Full output of dcgmi -v

  • Relevant and/or requested logs, below

Logging

This topic discusses various ways DCGM can be configured in order to produce detailed logs.

Enable Logging Using Standalone Hostengine

When launching nv-hostengine:

  • Add the -f /path/to/log parameter to specify where to write the log

  • Add the --log-level DEBUG parameter to specify DEBUG logging

This example will collect debug logs from the standalone hostengine for the duration of its lifetime. The log file will be written to /tmp/nv-hostengine.log. Example:

% sudo nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG

Enable Logging Using Embedded Hostengine

When using an embedded hostengine, if running as the root user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8). While still in the same session that you ran sudo:

  • Use export __DCGM_DBG_FILE=/path/to/log to specify where to write the log

  • Use export __DCGM_DBG_LVL=6 to specify DEBUG logging

  • Use env | grep __DCGM_DBG to confirm the variables are set

  • Run the desired command

This example will collect debug logs from the embedded host engine while running the short diagnostic. The log file will be written to /tmp/embedded.log. Example:

% sudo -i
  (prompts for password)
# export __DCGM_DBG_FILE=/tmp/embedded.log
# export __DCGM_DBG_LVL=6
# env | grep __DCGM_DBG
  (output)
  __DCGM_DBG_FILE=/tmp/embedded.log
  __DCGM_DBG_LVL=6
# dcgmi diag -r short
  ...

Enable Diagnostic Logging

The diagnostic produces additional useful logging. When running the diagnostic through dcgmi:

  • Add the --debugLogFile /path/to/log parameter to specify where to write the log

  • Add the -d DEBUG parameter to specify DEBUG logging

This example will collect debug logs from the short diagnostic. The log file will be written to /tmp/nvvs.log. Example:

% dcgmi diag --debugLogFile /tmp/nvvs.log -d DEBUG

Enable NVML Logging

In some cases, NVIDIA engineers may request NVML logs to aid in debugging.

If running as the root user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8). While still in the same session that you ran sudo:

  • Use export __NVML_DBG_FILE=/path/to/log to specify where to write the log

  • Use export __NVML_DBG_LVL=DEBUG to specify DEBUG logging

  • Use env | grep __NVML_DBG to confirm the variables are set

  • While still in the same session, add any other necessary environment variables (i.e., if you are running an embedded host engine)

  • Run the desired command

Note

If using the standalone hostengine, a separate __NVML_DBG_FILE should be specified for the hostengine and the desired command. See the example that follows.

This example will collect NVML logs and debug logs from a standalone hostengine, as well as NVML and debug logs from the long diagnostic. The NVML logs for the hostengine will be written to /tmp/hostengine.nvml.log, and the NVML logs for the diagnostic will be written to /tmp/nvvs.nvml.log. Example:

% sudo -i
(prompts for password)
# export __NVML_DBG_FILE=/tmp/hostengine.nvml.log
# export __NVML_DBG_LVL=DEBUG
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/hostengine.nvml.log
__NVML_DBG_LVL=DEBUG
# nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG
# export __NVML_DBG_FILE=/tmp/nvvs.nvml.log
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/nvvs.nvml.log
__NVML_DBG_LVL=DEBUG
# dcgmi diag -r long --debugLogFile /tmp/nvvs.log -d DEBUG
...