DCGM Release Notes
4.0.0
New Features
Entity Centric Messages
dcgmi diag
output has been revised to report errors and info messages along with entity information. This will allow the diagnostic to report GPUs and non-GPU hardware, including Nvidia Grace CPUs and NVSwitches.Updated
dcgmDiagResponse_v
struct anddcgmRunDiag_v
message format.
NVBandwidth
There’s a new plugin which launches NVBandwidth to check inter-GPU communication on a single node, which is supported for CUDA 12.
NVLink5 Monitoring
DCGM will now use the NVSDM library (if available) to monitor NVLink5.
Several new fields were added to monitor GPU NVLinks:
DCGM_FI_DEV_NVLINK_COUNT_TX_PACKETS
DCGM_FI_DEV_NVLINK_COUNT_TX_BYTES
DCGM_FI_DEV_NVLINK_COUNT_RX_PACKETS
DCGM_FI_DEV_NVLINK_COUNT_RX_BYTES
DCGM_FI_DEV_NVLINK_COUNT_RX_MALFORMED_PACKET_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_RX_BUFFER_OVERRUN_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_RX_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_RX_REMOTE_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_RX_GENERAL_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_LOCAL_LINK_INTEGRITY_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_TX_DISCARDS
DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_SUCCESSFUL_EVENTS
DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_FAILED_EVENTS
DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_EVENTS
DCGM_FI_DEV_NVLINK_COUNT_RX_SYMBOL_ERRORS
DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER
DCGM_FI_DEV_NVLINK_ERROR_DL_CRC
DCGM_FI_DEV_NVLINK_ERROR_DL_RECOVERY
DCGM_FI_DEV_NVLINK_ERROR_DL_REPLAY
These new fields will also display in the output of
dcgmi nvlink -e
.
Miscellaneous
In addition to the automatically created
/var/log/nvidia-dcgm
directory by the deb/rpm post-install scripts,nv-hostengine
will attempt to create the directory for the log files at startup specified by either theDCGM_HOME_DIR
environment variable or the--home-dir
command line argument.Debug symbol packages are available for non-proprietary packages in the RPM format.
Improvements
NVIDIA Grace CPU serial numbers are now available via DCGM API.
Diagnostics run levels 3 and 4 now include Grace CPU EUD.
Grace CPU EUD can be run individually via
dcgmi diag -r cpu_eud
.dcgmi diag
output will display detected Grace CPUsTheir respective serial numbers are reported in JSON format output.
The pulse test has added additional patterns to better cover Hopper GPUs.
The CUDA kernels used by DCGM are now compiled against CUDA 12.6.3.
/dev/kmsg
is now parsed to detect some XIDs that were previously undetected.PCIe test errors have been improved for clarity.
Fixed Issues
The occasional hang in the pulse test for Hopper GPUs has fixed.
Many errors that were previously misattributed to multiple GPUs are now correctly attributed to only the offending hardware.
A false positive warning on the memtest has been fixed.
The rate of PCIe replays required to cause a failure has been corrected (it was previously too low).
An incorrect abort due to signal mishandling during the PCIe test was resolved.
Fixed an issue that prevented reporting more than 50% utilization for tensor activity on L20 GPUs.
Deprecations and Breaking Changes
New JSON format for
dcgmi diag
Please see dcgm_diag_schema.json for the updated format.
Removed the
tmp_dir
parameter in the eud plugin (eud and cpu_eud).Subtests of the DCGM Software Diagnostic are no longer individually reported; aggregated results of the software test are now reported instead.
NVVS (long-deprecated) will no longer write human understandable output.
The
dcgmActionValidate_v2()
API function now prioritizes the argument group ID. Argument entity IDs will not be considered except when the group ID is set toDCGM_GROUP_NULL
.The
-g
argument todcgmi diag
, used to specify a list of GPUs to run the diagnostic on, has been deprecated and may be removed in a future release. For compatibility with future releases, use-i
to specify the list of entities to run the diagnostic on.dcgm.service
has been demoted from being a stand-alone systemd unit to being an alias of thenvidia-dcgm.service
systemd unit.Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.
Component packages are as follows:
datacenter-gpu-manager-4-core
Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product
datacenter-gpu-manager-4-cuda11
Provides the CUDA11-specific binaries available through the DCGM open source product
datacenter-gpu-manager-4-cuda12
Provides the CUDA12-specific binaries available through the DCGM open source product
datacenter-gpu-manager-4-proprietary
Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product
datacenter-gpu-manager-4-proprietary-cuda11
Provides CUDA11 binaries not distributed as part of the DCGM open source product
datacenter-gpu-manager-4-proprietary-cuda12
Provides CUDA12 binaries not distributed as part of the DCGM open source product
datacenter-gpu-manager-4-development
Provides files necessary for the development of downstream software dependent on the DCGM library
Additional information is in the package documentation.
Installation paths have been updated to more closely conform to the Filesystem Hierarchy Specification version 3.0
Binaries previously installed to
/usr/share/nvidia-validation-suite/
are now installed to/usr/libexec/datacenter-gpu-manager-4/
Administrator scripts previously installed to
/usr/local/dcgm/scripts/
are now installed to/usr/sbin/
Python bindings are now installed to
/usr/share/datacenter-gpu-manager-4/bindings/python3/
Sample configuration files are now installed to
/usr/share/doc/datacenter-gpu-manager-4/examples/
C header files for development of software dependent on libdcgm are installed to
/usr/include/datacenter-gpu-manager-4
CMake find package modules are now installed to
/usr/share/cmake/
Software development samples now installed to
/usr/src/datacenter-gpu-manager-4/
Known Issues
The pulse test sometimes crashes inconsistently.
3.3.9
New Features
Added support for H100 144GB BM3 (devId 2348)
Added support for H20 HBM3e (232c)
Improvements
Added SoC power utilization telemetry for Grace CPUs
Fixed Issues
Corrected an issue that caused spurious dcgmproftester failures in MIG environments
Corrected an issue that caused dcgmproftester worker processes to crash on shutdown
Corrected an issue where sequential DCGM tests would report that GPU resources are busy
Corrected an issue that caused diag -r 4 Memtest to fail with a warning on healthy H100 GPUs.
3.3.8
New Features
DCGM diagnostic now includes the –expectedNumEntities parameter to specify the expected number of GPUs in default groups. This helps identify potential fall-off-the-bus GPUs by failing the diagnostic if the actual GPU count differs from the expected number.
The DCGM diagnostic now has an unlimited default timeout, replacing the previous 8-hour limit. Users can set a custom timeout using the –timeout command line argument.
DCGM diagnostic now supports the H200NVL GPU (SKU 0x233b).
Improvements
The DCGM Diagnostic now fails early if a pending row remapping is detected.
The nvidia-dcgm service has been configured to initiate in the appropriate sequence with other systemd services, including nvidia-mig-manager.
The DCGM Diagnostic now supports test parameter values up to 1024 characters, allowing for more detailed customization.
The EUD diagnostic now supports multiple specifications of the passthrough_args parameter via the command line. These specifications are subsequently concatenated to form the final parameter value.
The DCGM Diagnostic command line now allows multiple instances of the -p/–parameters option. However, each test’s parameter, except for the eud.passthrough_args and cpu_eud.passthrough_args, should still be specified only once.
The CPU EUD (dcgmi diag -r cpu_eud) now runs as root following the GPU EUD (dcgmi diag -r eud) behavior.
Fixed Issues
The DCGM Diagnostic Software Plugin now correctly attributes errors to proper GPU indices.
Fixed the DCGM Diagnostic PCIe and Memory Bandwidth plugins crashes on systems with multiple NUMA nodes.
The DCGM Health Monitoring for the PCIe bus error rates now depend on the PCIe generation and expected throughput.
Fixed Grace CPU utilization computation.
3.3.7
New Features
Initial support of Grace CPU EUD. The new dcgmi diag -r cpu_eud command. Requires installation of the cpueud package.
EUD is enabled for aarch64 platform.
Critical XID events can now be parsed from kernel logs.
Improvements
DCGM now works in environments without Nvidia GPU drivers installed to support environments with only Grace CPUs.
The dcgmi output now includes the EUD version.
Fixed Issues
Fixed segmentation fault error in dcgmi during the diagnostic run.
Fixed an issue that stopped the dcgmproftester from working in mixed MIG environments.
Fixed an issue that did not allow the dcgmproftester to run all tests in a single-GPU environment.
T400 and T400 4Gb SKUs are disabled in dcgmproftester.
3.3.6
New Features
Added support of HBM temperature sensors.
Fixed Issues
Fixed an issue when DCGM reports extremely high temperature values on some GPUs.
Fixed overflow in the Memory test.
Fixed an issue that could lead to GSP timeout errors in the OpenRM driver.
Fixed an issue when the Pulse test and EUD tests could report issues with the GPU even when the GPU is healthy.
Fixed an issue that lead to incorrect Grace CPU utilization and temperature values.
Fixed an issue with duplicated errors in the diag reporting.
Fixed an issue that lead to a paused DCGM state if EUD test is interrupted.
3.3.5
New Features
The DCGM Diagnostic’s diagnostic plugin will now fail if any NaN values are detected in the result matrix.
Added support for H200 (devId 2335)
Added support for H20 (devId 2329)
Improvements
DCGM Diagnostic’s Targeted Power plugin will now use FP64 math to achieve higher power usage on GH200 (devId 2342)
Improved DCGM Diagnostic’s Software plugin’s ability to find installed libraries on the system as part of its library check.
Fixed Issues
Addressed an issue in the SysMon module that made DCGM startups non-determistic
3.3.3
New Features
Added support for the L2 GPU.
Improvements
Prevented duplicate errors from being returned in DCGM Diag’s json/text output
Fixed Issues
Fixed reporting of Cuda errors in DCGM Diag to be per-GPU rather than for all GPUs.
3.3.2
New Features
Added support for L20 GPU.
Fixed Issues
Added the gpuId to the JSON output when the DCGM Diag Deployment plugin fails.
3.3.1
New Features
Added support for A800 20bd SKU.
Added support for water-cooled A800 GPU.
Added CPU power and thermal health checks.
Added C2C support.
Improvements
All XIDs during diagnostics are now reported.
Some logs’ verbosity was reduced from Error to Debug level.
Stopped checking NVLink replay counts as a failure condition.
Made EUD independent from service-account. Fixed direct run of the EUD diagnostic.
The multi-node health check script is now included in the installation packages.
Relaxed PCI width testing for QA scripts.
Fixed Issues
Fixed EUD diagnostic when MLE parsing is enabled.
Fixed setting of logging severity via dcgmi.
Fix crash in pulsetest.
Resolved an issue causing diagnostic to hang on systems with odd number of GPUs.
3.3.0
New Features
Added support for monitoring NVIDIA Grace CPUs
Added DCGM Diag support for the GPUs of Grace + Hopper systems (devId 2342)
Added the following fieldIds for NvSwitch power: DCGM_FI_DEV_NVSWITCH_POWER_VDD, DCGM_FI_DEV_NVSWITCH_POWER_DVDD, DCGM_FI_DEV_NVSWITCH_POWER_HVDD
Added DCGM Diag pulse test support for the L4 GPU
Improvements
Reworked DCGM Diag error reporting to include more specific error categories and next steps to aid in automation workflows
Data Center Profiling metrics are now allowed on SKUs with brand DCGM_BRAND_NVIDIA_RTX like A6000.
Added error id, category, and severity to the dcgmi diag –json output for the Deployment Plugin
Fixed Issues
Added a workaround for DCGM_FI_DEV_MEMORY_TEMP being BLANK on r545 drivers. This is due to NVIDIA Bug 4300930 in the NVML library.
Fixed an uninitialized memory bug in the memtest plugin of
dcgmi diag -r 4
.
3.2.6
New Features
Added DCGM Diag support for L40S, H100 PCIe (devId 2321), and H800 PCIe (devId 233a)
Improvements
Added logging of health check failures to /var/log/nv-hostengine.log in addition to the dcgmHealthCheck() API returning errors.
Fixed Issues
Fixed dcgmi diag’s Permission and OS Blocks subtest failing within containers.
Fixed
dcgmi diag -r eud eud.suite_level
returning Invalid ParameterFixed a segfault in DCGM Diag’s nvvs process when GPUs failed to initialize
3.2.5
New Features
DCGM Diag’s PCIe test will now utilize subprocesses and NUMA to achieve optimal D2H and H2D bandwidth on some AMD CPUs where that is required.
Improvements
The DCGM Diag PCIe plugin now uses Bit Error Rates (BER) instead of static thresholds when detecting excessive PCIe replay.
Added a reminder to restart the DCGM service when running the DCGM Diag warns about the nvvs binary not being found.
Fixed Issues
Fixed dcgmi diag not running on ARM64 and PPC64LE platforms in DCGM 3.2.3.
Fixed RPATH for the DCGM libraries on platforms where there are dcgm libraries in /lib/ directory (ppc64le rhel).
3.2.3
New Features
Added a reference implementation of DCGM + NCCL multi-node testing.
Added a subtest to DCGM Diagnostic’s PCIe test that does GEMMs concurrent to P2P copies.
Added -r production_testing to DCGM Diagnostics to capture production line testing as a specific use case.
Added detection of host side PCIe replays to dcgmi diag -r production_testing as a failure condition.
Added support for profiling telemetry fieldIds 1001+ for Ada L4
Added power telemetry for NvSwitches.
Added gather-dcgm-logs.sh to gather all DCGM log files when submitting bugs
Improvements
Removed DCGM’s dependency on OpenMP
Added the discrete error_id to the JSON output of DCGM Diag to enable scripting actions based on error codes.
The DCGM Diagnostic’s EUD plugin now writes its logs to /var/logs/dcgm like the rest of DCGM.
Fixed Issues
Fixed nv-hostengine thrashing the heap under heavy load.
Fixed DCP metrics on H100 sometimes returning N/A values under MIG.
Deprecations
dcgm_prometheus.py has been deprecated. Please use DCGM exporter for Prometheus integration
3.1.8
Improvements
Added a static library of libdcgm.
Improved the DCGM Diagnostic PCIe Plugin’s detection of broken P2P between GPUs
Fixed Issues
Fixed an issue where DCGM could hang on systems with NvSwitches
Fixed a timing issue where field IDs 1001+ could return N/A values on H100 GPUs.
Fixed dcgmproftester11 not working on drivers r515 and older
Fixed dcgmproftester10 not working for FP16, FP32, and FP64
Fixed minor bugs in the DCGM Diagnostic EUD plugin.
3.1.7
Improvements
Added support for the NVIDIA L40 and NVIDIA L4 (based on the Ada Lovelace architecture)
Added support for the 800-series of NVIDIA GPU products
Included metadata on software versions and GPUs detected when running DCGM Diagnostics
Updated configuration parameters for the Input EDPp test on H100 and added the ability for users to select a subset of the test patterns.
Fixed Issues
Fixed an issue where DCGM Diagnostics would hang in scenarios where the GPU is unable to be enumerated any longer on the PCIe bus during a diagnostics run.
Fixed an issue where installing DCGM fails to install on SLES 15 due to the inability to create a
nvidia-dcgm
group.Fixed a memory leak issue in
libdcgmmoduleprofiling.so
when monitoring MIG devices with profiling metrics
Known Issues
When profiling metrics are monitoried (for example,
dcgmi dmon -e 1001
), some metrics might be reported as “N/A” for some intervals with an active CUDA context (for example, dcgmproftester12 -t 1007 -d 50 –no-dcgm-validation). This issue is due to a timing issue where samples are being cleaned up before they can be used to calculate the metrics. This issue is resolved in DCGM 3.1.8.Running
dcgmproftester12
without any arguments may result in an error after a while:Error -24 from InitializeGpus(). Exiting.
3.1.6
Improvements
Fixed Issues
Fixed an error with missing symbols when running DCGM Diagnostics run levels 3 & 4. The errors manifest as follows: Unable to merge JSON results for regular Diag and EUD tests and logs will contain this error: Couldn’t load a definition for GetPluginInterfaceVersion in plugin.
Fixed an intermittent crash when running the Input EDPp tests.
Fixed incorrect failure of DCGM diagnostics on discovering inactive NVLinks.
3.1.3
New Features
Added support for the NVIDIA Hopper architecture and NVIDIA H100 PCIe product:
Added support for Hopper performance monitoring APIs
Added support for Hopper Multi-Instance GPU profiles
Added support for DCGM GPU diagnostics
Added support for NVIDIA Ada architecture, NVIDIA L40 product
Added telemetry for NVSwitches. See API documentation (fieldIdentifers) for new fields.
Added support for End User Diagnostics (EUD) as a preview feature for specific PCIe products
Added support for CUDA 12
Added the ability for DCGM Diagnostics to skip the NVLink integration test when NVLinks are not enabled. This can be accomplished by adding the
-p pcie.test_nvlink_status=false
option to the dcgmi diag command-line.Added support for Red Hat Enterprise Linux (RHEL) 9.
Major API changes and Deprecations
The following features have been dropped or deprecated starting with DCGM 3.0:
The socket protocol based on protobuf has been removed
The DCGM introspection APIs have been removed (except for host engine memory usage and host CPU usage)
The following field identifers have been removed:
DCGM_FI_DEV_GRAPHICS_PIDS
DCGM_FI_DEV_COMPUTE_PIDS
DCGM_FI_DEV_GPU_UTIL_SAMPLES
DCGM_FI_DEV_MEM_COPY_UTIL_SAMPLES
Support for CUDA 9 and CUDA 10 based drivers has been removed; DCGM diagnostics cannot be used on systems with these older driver installations
For reading metrics, the
dcgmProfWatchFields()
API is no longer supported (and will return aDCGM_ST_NOT_SUPPORTED
error.) Instead, the more genericdcgmWatchFields()
API should be used.The
sm_stress
test is no longer run as default for-r 3
and-r 4
run levels. To invoke the test separately,dcgmi diag -r sm_stress
can be used.
Fixed Issues
The Input EDPp test (“Pulse”) with
-r 4
and Memory bandwidth tests are now supported for H100 PCIe in this releaseFixed an issue with the Pulse test (under
-r 4
) which caused the test to hang in some scenarios on A100 systemsFixed an issue in DCGM diagnostics where a failure on one GPU would be attributed to all GPUs in a multi-GPU system.
Fixed an issue with the calculation of the
DCGM_FI_DEV_FB_USED_PERCENT
metricFixed an issue with package dependencies on the
libgomp
package on SUSE SLES based distributionsFixed an issue where DCGM diagnostics was not handling driver timeouts correctly
Fixed an issue with DCGM diagnostics to not print out
Error: unable to establish a connection to the specified host: localhost
when a--host
parameter was not passed.Fixed an issue in DCGM diagnostics to handle Ctrl-C signals correctly.
Fixed an issue with metrics in MIG mode where all field values would report incorrect values after a few hours.
Fixed an issue on A100 in MIG mode where some whole GPU metrics such as temperature, power etc. were returned as 0 for MIG devices.
Fixed an issue on A100 in MIG mode to report memory (
DCGM_FI_DEV_FB_FREE
,DCGM_FI_DEV_FB_USED
andDCGM_FI_DEV_FB_TOTAL
) per MIG device.Fixed an issue where package installation would fail on RHEL systems.
Removed the redundant
temperature_max
setting from thediag-skus.yaml
configuration for DCGM Diagnostics.Fixed an issue where DCGM with R510+ drivers was using an incorrect NVML API to return memory usage. A new field identifier
DCGM_FI_DEV_FB_RESERVED
was added to distinguish between the actual usage and reserved memory.
Known Issues
On V100, DCGM metrics may be reported as 0 after some time interval when two or more CUDA contexts are active on the GPU.
On DGX-2/HGX-2 systems, ensure that
nv-hostengine
and the Fabric Manager service are started before usingdcgmproftester
for testing the new profiling metrics. See the Getting Started section in the DCGM User Guide for details on installation.On K80s,
nvidia-smi
may report hardware throttling (clocks_throttle_reasons.hw_slowdown = ACTIVE
) during DCGM Diagnostics (Level 3). The stressful workload results in power transients that engage the HW slowdown mechanism to ensure that the Tesla K80 product operates within the power capping limit for both long term and short term timescales. For Volta or later Tesla products, this reporting issue has been fixed and the workload transients are no longer flagged as “HW Slowdown”. The NVIDIA driver will accurately detect if the slowdown event is due to thermal thresholds being exceeded or external power brake event. It is recommended that customers ignore this failure mode on Tesla K80 if the GPU temperature is within specification.To report NVLINK bandwidth utilization DCGM programs counters in the HW to extract the desired information. It is currently possible for certain other tools a user might run, including
nvprof
, to change these settings after DCGM monitoring begins. In such a situation DCGM may subsequently return errors or invalid values for the NVLINK metrics. There is currently no way within DCGM to prevent other tools from modifying this shared configuration. Once the interfering tool is done a user of DCGM can repair the reporting by runningnvidia-smi nvlink -sc 0bz; nvidia-smi nvlink -sc 1bz
.