DCGM Release Notes¶
This version of DCGM (v3.3) requires a minimum R450 Linux datacenter driver (>= 450.80.02) that can be downloaded from NVIDIA Drivers. It is recommended to install the latest datacenter driver from NVIDIA drivers downloads site for use with DCGM.
Added support for L20 GPU.
Added support for A800 20bd SKU.
Added support for water-cooled A800 GPU.
Added CPU power and thermal health checks.
Added C2C support.
All XIDs during diagnostics are now reported.
Some logs’ verbosity was reduced from Error to Debug level.
Stopped checking NVLink replay counts as a failure condition.
Made EUD independent from service-account. Fixed direct run of the EUD diagnostic.
The multi-node health check script is now included in the installation packages.
Relaxed PCI width testing for QA scripts.
Fixed EUD diagnostic when MLE parsing is enabled.
Fixed setting of logging severity via dcgmi.
Fix crash in pulsetest.
Resolved an issue causing diagnostic to hang on systems with odd number of GPUs.
Added support for monitoring NVIDIA Grace CPUs
Added DCGM Diag support for the GPUs of Grace + Hopper systems (devId 2342)
Added the following fieldIds for NvSwitch power: DCGM_FI_DEV_NVSWITCH_POWER_VDD, DCGM_FI_DEV_NVSWITCH_POWER_DVDD, DCGM_FI_DEV_NVSWITCH_POWER_HVDD
Added DCGM Diag pulse test support for the L4 GPU
Reworked DCGM Diag error reporting to include more specific error categories and next steps to aid in automation workflows
Data Center Profiling metrics are now allowed on SKUs with brand DCGM_BRAND_NVIDIA_RTX like A6000.
Added error id, category, and severity to the dcgmi diag –json output for the Deployment Plugin
Added a workaround for DCGM_FI_DEV_MEMORY_TEMP being BLANK on r545 drivers. This is due to NVIDIA Bug 4300930 in the NVML library.
Fixed an uninitialized memory bug in the memtest plugin of
dcgmi diag -r 4.
Added DCGM Diag support for L40S, H100 PCIe (devId 2321), and H800 PCIe (devId 233a)
Added logging of health check failures to /var/log/nv-hostengine.log in addition to the dcgmHealthCheck() API returning errors.
Fixed dcgmi diag’s Permission and OS Blocks subtest failing within containers.
dcgmi diag -r eud eud.suite_levelreturning Invalid Parameter
Fixed a segfault in DCGM Diag’s nvvs process when GPUs failed to initialize
DCGM Diag’s PCIe test will now utilize subprocesses and NUMA to achieve optimal D2H and H2D bandwidth on some AMD CPUs where that is required.
The DCGM Diag PCIe plugin now uses Bit Error Rates (BER) instead of static thresholds when detecting excessive PCIe replay.
Added a reminder to restart the DCGM service when running the DCGM Diag warns about the nvvs binary not being found.
Fixed dcgmi diag not running on ARM64 and PPC64LE platforms in DCGM 3.2.3.
Fixed RPATH for the DCGM libraries on platforms where there are dcgm libraries in /lib/ directory (ppc64le rhel).
Added a reference implementation of DCGM + NCCL multi-node testing.
Added a subtest to DCGM Diagnostic’s PCIe test that does GEMMs concurrent to P2P copies.
Added -r production_testing to DCGM Diagnostics to capture production line testing as a specific use case.
Added detection of host side PCIe replays to dcgmi diag -r production_testing as a failure condition.
Added support for profiling telemetry fieldIds 1001+ for Ada L4
Added power telemetry for NvSwitches.
Added gather-dcgm-logs.sh to gather all DCGM log files when submitting bugs
Removed DCGM’s dependency on OpenMP
Added the discrete error_id to the JSON output of DCGM Diag to enable scripting actions based on error codes.
The DCGM Diagnostic’s EUD plugin now writes its logs to /var/logs/dcgm like the rest of DCGM.
Fixed nv-hostengine thrashing the heap under heavy load.
Fixed DCP metrics on H100 sometimes returning N/A values under MIG.
dcgm_prometheus.py has been deprecated. Please use DCGM exporter for Prometheus integration
Added a static library of libdcgm.
Improved the DCGM Diagnostic PCIe Plugin’s detection of broken P2P between GPUs
Fixed an issue where DCGM could hang on systems with NvSwitches
Fixed a timing issue where field IDs 1001+ could return N/A values on H100 GPUs.
Fixed dcgmproftester11 not working on drivers r515 and older
Fixed dcgmproftester10 not working for FP16, FP32, and FP64
Fixed minor bugs in the DCGM Diagnostic EUD plugin.
Added support for the 800-series of NVIDIA GPU products
Included metadata on software versions and GPUs detected when running DCGM Diagnostics
Updated configuration parameters for the Input EDPp test on H100 and added the ability for users to select a subset of the test patterns.
Fixed an issue where DCGM Diagnostics would hang in scenarios where the GPU is unable to be enumerated any longer on the PCIe bus during a diagnostics run.
Fixed an issue where installing DCGM fails to install on SLES 15 due to the inability to create a
Fixed a memory leak issue in
libdcgmmoduleprofiling.sowhen monitoring MIG devices with profiling metrics
When profiling metrics are monitoried (for example,
dcgmi dmon -e 1001), some metrics might be reported as “N/A” for some intervals with an active CUDA context (for example, dcgmproftester12 -t 1007 -d 50 –no-dcgm-validation). This issue is due to a timing issue where samples are being cleaned up before they can be used to calculate the metrics. This issue is resolved in DCGM 3.1.8.
dcgmproftester12without any arguments may result in an error after a while:
Error -24 from InitializeGpus(). Exiting.
Fixed an error with missing symbols when running DCGM Diagnostics run levels 3 & 4. The errors manifest as follows: Unable to merge JSON results for regular Diag and EUD tests and logs will contain this error: Couldn’t load a definition for GetPluginInterfaceVersion in plugin.
Fixed an intermittent crash when running the Input EDPp tests.
Fixed incorrect failure of DCGM diagnostics on discovering inactive NVLinks.
Added support for the NVIDIA Hopper architecture and NVIDIA H100 PCIe product:
Added support for Hopper performance monitoring APIs
Added support for Hopper Multi-Instance GPU profiles
Added support for DCGM GPU diagnostics
Added support for NVIDIA Ada architecture, NVIDIA L40 product
Added telemetry for NVSwitches. See API documentation (fieldIdentifers) for new fields.
Added support for End User Diagnostics (EUD) as a preview feature for specific PCIe products
Added support for CUDA 12
Added the ability for DCGM Diagnostics to skip the NVLink integration test when NVLinks are not enabled. This can be accomplished by adding the
-p pcie.test_nvlink_status=falseoption to the dcgmi diag command-line.
Added support for Red Hat Enterprise Linux (RHEL) 9.
Major API changes and Deprecations¶
The following features have been dropped or deprecated starting with DCGM 3.0:
The socket protocol based on protobuf has been removed
The DCGM introspection APIs have been removed (except for host engine memory usage and host CPU usage)
The following field identifers have been removed:
Support for CUDA 9 and CUDA 10 based drivers has been removed; DCGM diagnostics cannot be used on systems with these older driver installations
For reading metrics, the
dcgmProfWatchFields()API is no longer supported (and will return a
DCGM_ST_NOT_SUPPORTEDerror.) Instead, the more generic
dcgmWatchFields()API should be used.
sm_stresstest is no longer run as default for
-r 4run levels. To invoke the test separately,
dcgmi diag -r sm_stresscan be used.
The Input EDPp test (“Pulse”) with
-r 4and Memory bandwidth tests are now supported for H100 PCIe in this release
Fixed an issue with the Pulse test (under
-r 4) which caused the test to hang in some scenarios on A100 systems
Fixed an issue in DCGM diagnostics where a failure on one GPU would be attributed to all GPUs in a multi-GPU system.
Fixed an issue with the calculation of the
Fixed an issue with package dependencies on the
libgomppackage on SUSE SLES based distributions
Fixed an issue where DCGM diagnostics was not handling driver timeouts correctly
Fixed an issue with DCGM diagnostics to not print out
Error: unable to establish a connection to the specified host: localhostwhen a
--hostparameter was not passed.
Fixed an issue in DCGM diagnostics to handle Ctrl-C signals correctly.
Fixed an issue with metrics in MIG mode where all field values would report incorrect values after a few hours.
Fixed an issue on A100 in MIG mode where some whole GPU metrics such as temperature, power etc. were returned as 0 for MIG devices.
Fixed an issue on A100 in MIG mode to report memory (
DCGM_FI_DEV_FB_TOTAL) per MIG device.
Fixed an issue where package installation would fail on RHEL systems.
Removed the redundant
temperature_maxsetting from the
diag-skus.yamlconfiguration for DCGM Diagnostics.
Fixed an issue where DCGM with R510+ drivers was using an incorrect NVML API to return memory usage. A new field identifier
DCGM_FI_DEV_FB_RESERVEDwas added to distinguish between the actual usage and reserved memory.
On V100, DCGM metrics may be reported as 0 after some time interval when two or more CUDA contexts are active on the GPU.
On DGX-2/HGX-2 systems, ensure that
nv-hostengineand the Fabric Manager service are started before using
dcgmproftesterfor testing the new profiling metrics. See the Getting Started section in the DCGM User Guide for details on installation.
nvidia-smimay report hardware throttling (
clocks_throttle_reasons.hw_slowdown = ACTIVE) during DCGM Diagnostics (Level 3). The stressful workload results in power transients that engage the HW slowdown mechanism to ensure that the Tesla K80 product operates within the power capping limit for both long term and short term timescales. For Volta or later Tesla products, this reporting issue has been fixed and the workload transients are no longer flagged as “HW Slowdown”. The NVIDIA driver will accurately detect if the slowdown event is due to thermal thresholds being exceeded or external power brake event. It is recommended that customers ignore this failure mode on Tesla K80 if the GPU temperature is within specification.
To report NVLINK bandwidth utilization DCGM programs counters in the HW to extract the desired information. It is currently possible for certain other tools a user might run, including
nvprof, to change these settings after DCGM monitoring begins. In such a situation DCGM may subsequently return errors or invalid values for the NVLINK metrics. There is currently no way within DCGM to prevent other tools from modifying this shared configuration. Once the interfering tool is done a user of DCGM can repair the reporting by running
nvidia-smi nvlink -sc 0bz; nvidia-smi nvlink -sc 1bz.