DCGM Release Notes

This version of DCGM (v3.1) requires a minimum R450 Linux datacenter driver (>= 450.80.02) that can be downloaded from NVIDIA Drivers. It is recommended to install the latest datacenter driver from NVIDIA drivers downloads site for use with DCGM.

3.1.6

Improvements

  • Added support for HGX H100 and H100 SXM products.

  • Updated the Input EDPp tests to support H100 products.

  • Added a warning when users attempt to run DCGM Diagnostics on GPUs configured in MIG mode, but no MIG devices are created.

Fixed Issues

  • Fixed an error with missing symbols when running DCGM Diagnostics run levels 3 & 4. The errors manifest as follows: Unable to merge JSON results for regular Diag and EUD tests and logs will contain this error: Couldn’t load a definition for GetPluginInterfaceVersion in plugin.

  • Fixed an intermittent crash when running the Input EDPp tests.

  • Fixed incorrect failure of DCGM diagnostics on discovering inactive NVLinks.

3.1.3

New Features

  • Added support for the NVIDIA Hopper architecture and NVIDIA H100 PCIe product:

    • Added support for Hopper performance monitoring APIs

    • Added support for Hopper Multi-Instance GPU profiles

    • Added support for DCGM GPU diagnostics

  • Added support for NVIDIA Ada architecture, NVIDIA L40 product

  • Added telemetry for NVSwitches. See API documentation (fieldIdentifers) for new fields.

  • Added support for End User Diagnostics (EUD) as a preview feature for specific PCIe products

  • Added support for CUDA 12

  • Added the ability for DCGM Diagnostics to skip the NVLink integration test when NVLinks are not enabled. This can be accomplished by adding the -p pcie.test_nvlink_status=false option to the dcgmi diag command-line.

  • Added support for Red Hat Enterprise Linux (RHEL) 9.

Major API changes and Deprecations

The following features have been dropped or deprecated starting with DCGM 3.0:

  • The socket protocol based on protobuf has been removed

  • The DCGM introspection APIs have been removed (except for host engine memory usage and host CPU usage)

  • The following field identifers have been removed:

    • DCGM_FI_DEV_GRAPHICS_PIDS

    • DCGM_FI_DEV_COMPUTE_PIDS

    • DCGM_FI_DEV_GPU_UTIL_SAMPLES

    • DCGM_FI_DEV_MEM_COPY_UTIL_SAMPLES

  • Support for CUDA 9 and CUDA 10 based drivers has been removed; DCGM diagnostics cannot be used on systems with these older driver installations

  • For reading metrics, the dcgmProfWatchFields() API is no longer supported (and will return a DCGM_ST_NOT_SUPPORTED error.) Instead, the more generic dcgmWatchFields() API should be used.

  • The sm_stress test is no longer run as default for -r 3 and -r 4 run levels. To invoke the test separately, dcgmi diag -r sm_stress can be used.

Fixed Issues

  • The Input EDPp test (“Pulse”) with -r 4 and Memory bandwidth tests are now supported for H100 PCIe in this release

  • Fixed an issue with the Pulse test (under -r 4) which caused the test to hang in some scenarios on A100 systems

  • Fixed an issue in DCGM diagnostics where a failure on one GPU would be attributed to all GPUs in a multi-GPU system.

  • Fixed an issue with the calculation of the DCGM_FI_DEV_FB_USED_PERCENT metric

  • Fixed an issue with package dependencies on the libgomp package on SUSE SLES based distributions

  • Fixed an issue where DCGM diagnostics was not handling driver timeouts correctly

  • Fixed an issue with DCGM diagnostics to not print out Error: unable to establish a connection to the specified host: localhost when a --host parameter was not passed.

  • Fixed an issue in DCGM diagnostics to handle Ctrl-C signals correctly.

  • Fixed an issue with metrics in MIG mode where all field values would report incorrect values after a few hours.

  • Fixed an issue on A100 in MIG mode where some whole GPU metrics such as temperature, power etc. were returned as 0 for MIG devices.

  • Fixed an issue on A100 in MIG mode to report memory (DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_TOTAL) per MIG device.

  • Fixed an issue where package installation would fail on RHEL systems.

  • Removed the redundant temperature_max setting from the diag-skus.yaml configuration for DCGM Diagnostics.

  • Fixed an issue where DCGM with R510+ drivers was using an incorrect NVML API to return memory usage. A new field identifier DCGM_FI_DEV_FB_RESERVED was added to distinguish between the actual usage and reserved memory.

Known Issues

  • On V100, DCGM metrics may be reported as 0 after some time interval when two or more CUDA contexts are active on the GPU.

  • On DGX-2/HGX-2 systems, ensure that nv-hostengine and the Fabric Manager service are started before using dcgmproftester for testing the new profiling metrics. See the Getting Started section in the DCGM User Guide for details on installation.

  • On K80s, nvidia-smi may report hardware throttling (clocks_throttle_reasons.hw_slowdown = ACTIVE) during DCGM Diagnostics (Level 3). The stressful workload results in power transients that engage the HW slowdown mechanism to ensure that the Tesla K80 product operates within the power capping limit for both long term and short term timescales. For Volta or later Tesla products, this reporting issue has been fixed and the workload transients are no longer flagged as “HW Slowdown”. The NVIDIA driver will accurately detect if the slowdown event is due to thermal thresholds being exceeded or external power brake event. It is recommended that customers ignore this failure mode on Tesla K80 if the GPU temperature is within specification.

  • To report NVLINK bandwidth utilization DCGM programs counters in the HW to extract the desired information. It is currently possible for certain other tools a user might run, including nvprof, to change these settings after DCGM monitoring begins. In such a situation DCGM may subsequently return errors or invalid values for the NVLINK metrics. There is currently no way within DCGM to prevent other tools from modifying this shared configuration. Once the interfering tool is done a user of DCGM can repair the reporting by running nvidia-smi nvlink -sc 0bz; nvidia-smi nvlink -sc 1bz.