Debugging and Troubleshooting
General Problem Reporting
When reporting a problem, please always include:
nvidia-bug-report.log.gz
- produced bynvidia-bug-report.sh
Full output of
dcgmi -v
Relevant and/or requested logs, below
Logging
This topic discusses various ways DCGM can be configured in order to produce detailed logs.
Enable Logging Using Standalone Hostengine
When launching nv-hostengine
:
Add the
-f /path/to/log
parameter to specify where to write the logAdd the
--log-level DEBUG
parameter to specify DEBUG logging
This example will collect debug logs from the standalone hostengine for the duration of its lifetime. The log file will be written to /tmp/nv-hostengine.log
.
Example:
% sudo nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG
Enable Logging Using Embedded Hostengine
When using an embedded hostengine, if running as the root
user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8)
.
While still in the same session that you ran sudo
:
Use
export __DCGM_DBG_FILE=/path/to/log
to specify where to write the logUse
export __DCGM_DBG_LVL=6
to specify DEBUG loggingUse
env | grep __DCGM_DBG
to confirm the variables are setRun the desired command
This example will collect debug logs from the embedded host engine while running the short diagnostic. The log file will be written to /tmp/embedded.log
.
Example:
% sudo -i
(prompts for password)
# export __DCGM_DBG_FILE=/tmp/embedded.log
# export __DCGM_DBG_LVL=6
# env | grep __DCGM_DBG
(output)
__DCGM_DBG_FILE=/tmp/embedded.log
__DCGM_DBG_LVL=6
# dcgmi diag -r short
...
Enable Diagnostic Logging
The diagnostic produces additional useful logging. When running the diagnostic through dcgmi
:
Add the
--debugLogFile /path/to/log
parameter to specify where to write the logAdd the
-d DEBUG
parameter to specify DEBUG logging
This example will collect debug logs from the short diagnostic. The log file will be written to /tmp/nvvs.log
.
Example:
% dcgmi diag --debugLogFile /tmp/nvvs.log -d DEBUG
Enable NVML Logging
In some cases, NVIDIA engineers may request NVML logs to aid in debugging.
If running as the root
user or other privileged user, first change to that user using the appropriate command, i.e., sudo(8)
.
While still in the same session that you ran sudo
:
Use
export __NVML_DBG_FILE=/path/to/log
to specify where to write the logUse
export __NVML_DBG_LVL=DEBUG
to specify DEBUG loggingUse
env | grep __NVML_DBG
to confirm the variables are setWhile still in the same session, add any other necessary environment variables (i.e., if you are running an embedded host engine)
Run the desired command
Note
If using the standalone hostengine, a separate __NVML_DBG_FILE
should be specified for the hostengine and the desired command. See the example that follows.
This example will collect NVML logs and debug logs from a standalone hostengine, as well as NVML and debug logs from the long diagnostic.
The NVML logs for the hostengine will be written to /tmp/hostengine.nvml.log
, and the NVML logs for the diagnostic will be written to /tmp/nvvs.nvml.log
.
Example:
% sudo -i
(prompts for password)
# export __NVML_DBG_FILE=/tmp/hostengine.nvml.log
# export __NVML_DBG_LVL=DEBUG
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/hostengine.nvml.log
__NVML_DBG_LVL=DEBUG
# nv-hostengine -f /tmp/nv-hostengine.log --log-level DEBUG
# export __NVML_DBG_FILE=/tmp/nvvs.nvml.log
# env | grep __NVML_DBG
(output)
__NVML_DBG_FILE=/tmp/nvvs.nvml.log
__NVML_DBG_LVL=DEBUG
# dcgmi diag -r long --debugLogFile /tmp/nvvs.log -d DEBUG
...