Debugging and Troubleshooting
General Problem Reporting
When reporting a problem, please always include:
nvidia-bug-report.log.gz- produced bynvidia-bug-report.shFull output of
dcgmi -vRelevant and/or requested logs, below
Logging
This topic discusses various ways DCGM can be configured in order to produce detailed logs.
Enable Logging Using Standalone Hostengine
When launching nv-hostengine:
Add the
-f /path/to/logparameter to specify where to write the logAdd the
--log-level DEBUGparameter to specify DEBUG logging
This example will collect debug logs from the standalone hostengine for the duration of its lifetime. The log file will be written to /tmp/nv-hostengine.log.
Example:
% sudo nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG
Enable Logging Using Embedded Hostengine
When using an embedded hostengine, if running as the root user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8).
While still in the same session that you ran sudo:
Use
export __DCGM_DBG_FILE=/path/to/logto specify where to write the logUse
export __DCGM_DBG_LVL=6to specify DEBUG loggingUse
env | grep __DCGM_DBGto confirm the variables are setRun the desired command
This example will collect debug logs from the embedded host engine while running the short diagnostic. The log file will be written to /tmp/embedded.log.
Example:
% sudo -i
(prompts for password)
# export __DCGM_DBG_FILE=/tmp/embedded.log
# export __DCGM_DBG_LVL=6
# env | grep __DCGM_DBG
(output)
__DCGM_DBG_FILE=/tmp/embedded.log
__DCGM_DBG_LVL=6
# dcgmi diag -r short
...
Enable Diagnostic Logging
The diagnostic produces additional useful logging. When running the diagnostic through dcgmi:
Add the
--debugLogFile /path/to/logparameter to specify where to write the logAdd the
-d DEBUGparameter to specify DEBUG logging
This example will collect debug logs from the short diagnostic. The log file will be written to /tmp/nvvs.log.
Example:
% dcgmi diag --debugLogFile /tmp/nvvs.log -d DEBUG
Enable NVML Logging
In some cases, NVIDIA engineers may request NVML logs to aid in debugging.
If running as the root user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8).
While still in the same session that you ran sudo:
Use
export __NVML_DBG_FILE=/path/to/logto specify where to write the logUse
export __NVML_DBG_LVL=DEBUGto specify DEBUG loggingUse
env | grep __NVML_DBGto confirm the variables are setWhile still in the same session, add any other necessary environment variables (i.e., if you are running an embedded host engine)
Run the desired command
Note
If using the standalone hostengine, a separate __NVML_DBG_FILE should be specified for the hostengine and the desired command. See the example that follows.
This example will collect NVML logs and debug logs from a standalone hostengine, as well as NVML and debug logs from the long diagnostic.
The NVML logs for the hostengine will be written to /tmp/hostengine.nvml.log, and the NVML logs for the diagnostic will be written to /tmp/nvvs.nvml.log.
Example:
% sudo -i
(prompts for password)
# export __NVML_DBG_FILE=/tmp/hostengine.nvml.log
# export __NVML_DBG_LVL=DEBUG
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/hostengine.nvml.log
__NVML_DBG_LVL=DEBUG
# nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG
# export __NVML_DBG_FILE=/tmp/nvvs.nvml.log
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/nvvs.nvml.log
__NVML_DBG_LVL=DEBUG
# dcgmi diag -r long --debugLogFile /tmp/nvvs.log -d DEBUG
...