DCGM Release Notes¶
This version of DCGM (v2.4) requires a minimum R418 driver that can be downloaded from NVIDIA Drivers. On NVSwitch based systems such as DGX A100 or HGX A100, a minimum of Linux R450 (>=450.80.02) driver is required. If using the new profiling metrics capabilities in DCGM, then a minimum of Linux R418 (>= 418.87.01) driver is required. It is recommended to install the latest datacenter driver from NVIDIA drivers downloads site for use with DCGM.
Added the ability for DCGM Diagnostics to skip the NVLink integration test when NVLinks are not enabled. This can be accomplished by adding the
-p pcie.test_nvlink_status=falseoption to the
Fixed an issue with the calculation of the
Fixed an issue with package dependencies on the
libgomppackage on SUSE SLES based distributions
Fixed an issue where DCGM reported
N/Afor the memory field identifiers (ids: 250, 251, 252) for memory used, free and total. Also added a new field identifier
DCGM_FI_DEV_FB_USED_PERCENTto track the percentage of memory used.
Fixed printing of error messages when healthchecks (
dcgmi health -c) detect errors.
Added the ability for DCGM diagnostics to run iteratively. Users can specify
--iterations Xon the
dcgmi diagcommand to run X iterations of the diag successively.
Added run level 4 (
-r 4) for DCGM diagnostics. This level includes a suite of memory access pattern tests. Refer to the memory access pattern tests documentation.
Added the ability for
dcgmi diagto be able to run as a non-root user.
Added support for new field
Added support for Debian 11.
Added new profiling metrics to be able to track Tensor Core usage for different precision types (FP32, FP16 and INT8). These are available as new fields:
Added the ability for DCGM diagnostics to detect downgraded, disabled and faulty NVLinks
Improved error logging information from DCGM diagnostics - for example, reasons for tests being skipped, relevant reasons on test failures and GPU serial numbers.
Fixed an issue in DCGM diagnostics where a failure on one GPU would be attributed to all GPUs in a multi-GPU system.
Fixed an issue where DCGM diagnostics was not handling driver timeouts correctly.
Fixed an issue with DCGM diagnostics to not print out
Error: unable to establish a connection to the specified host: localhostwhen a
--hostparameter was not passed.
Fixed an issue in DCGM diagnostics to handle Ctrl-C signals correctly.
Fixed an issue with metrics in MIG mode where all field values would report incorrect values after a few hours.
Fixed an issue on A100 in MIG mode where some whole GPU metrics such as temperature, power etc. were returned as 0 for MIG devices.
Fixed an issue on A100 in MIG mode to report memory (
DCGM_FI_DEV_FB_TOTAL) per MIG device.
Fixed an issue where package installation would fail on RHEL systems.
Removed the redundant
temperature_maxsetting from the
diag-skus.yamlconfiguration for DCGM Diagnostics.
Fixed an issue where DCGM with R510+ drivers was using an incorrect NVML API to return memory usage. A new field identifier
DCGM_FI_DEV_FB_RESERVEDwas added to distinguish between the actual usage and reserved memory.
On V100, DCGM metrics may be reported as 0 after some time interval when two or more CUDA contexts are active on the GPU.
On DGX-2/HGX-2 systems, ensure that
nv-hostengineand the Fabric Manager service are started before using
dcgmproftesterfor testing the new profiling metrics. See the Getting Started section in the DCGM User Guide for details on installation.
nvidia-smimay report hardware throttling (
clocks_throttle_reasons.hw_slowdown = ACTIVE) during DCGM Diagnostics (Level 3). The stressful workload results in power transients that engage the HW slowdown mechanism to ensure that the Tesla K80 product operates within the power capping limit for both long term and short term timescales. For Volta or later Tesla products, this reporting issue has been fixed and the workload transients are no longer flagged as “HW Slowdown”. The NVIDIA driver will accurately detect if the slowdown event is due to thermal thresholds being exceeded or external power brake event. It is recommended that customers ignore this failure mode on Tesla K80 if the GPU temperature is within specification.
To report NVLINK bandwidth utilization DCGM programs counters in the HW to extract the desired information. It is currently possible for certain other tools a user might run, including
nvprof, to change these settings after DCGM monitoring begins. In such a situation DCGM may subsequently return errors or invalid values for the NVLINK metrics. There is currently no way within DCGM to prevent other tools from modifying this shared configuration. Once the interfering tool is done a user of DCGM can repair the reporting by running
nvidia-smi nvlink -sc 0bz; nvidia-smi nvlink -sc 1bz.