Getting Started

Supported Platforms

DCGM currently supports the following products and environments:

  • All Kepler (K80) and newer NVIDIA datacenter (previously, Tesla) GPUs

  • NVSwitch on DGX A100, HGX A100 and newer.

  • All Maxwell and newer non-datacenter (e.g. GeForce or Quadro) GPUs

  • CUDA 7.5+ and NVIDIA Driver R450+

  • Bare metal and virtualized (full passthrough only)


  • NVIDIA Driver R450 and later is required on systems using NVSwitch, such as DGX A100 or HGX A100. Starting with DCGM 2.0, Fabric Manager (FM) for NVSwitch systems is no longer bundled with DCGM packages. FM is a separate artifact that can be installed using the CUDA network repository. For more information, see the Fabric Manager User Guide.

  • Starting with v1.3, limited DCGM functionality is available on non-datacenter GPUs. More details are available in the section Feature Overview.

Supported Linux Distributions

Linux Distributions and Architectures

Linux Distribution

x86 (x86_64)

Arm64 (aarch64)

POWER (ppc64le)

Debian 11


RHEL 8.y/Rocky Linux 8.y




RHEL 9.y/Rocky Linux 9.y



RHEL/CentOS 7.y


SLES/OpenSUSE 15.y



Ubuntu 22.04 LTS



Ubuntu 20.04 LTS



Ubuntu 18.04 LTS





To run DCGM the target system must include the following NVIDIA components, listed in dependency order:

  • Supported NVIDIA Datacenter Drivers

  • On HGX systems, the Fabric Manager and NVSwitch Configuration & Query (NSCQ) packages

  • DCGM Runtime and SDK

  • DCGM Python bindings (if desired)

  • (Optional) Supported CUDA Toolkit

All of the core components are available as RPMs/DEBs from NVIDIA’s website. The Python bindings are available in the /usr/local/dcgm/bindings directory after installation. The user must be root or have sudo privileges for installation, as for any such packaging operations.


  • DCGM is tested and designed to run with NVIDIA Datacenter Drivers. Attempting to run on other drivers, such as a developer driver, could result in missing functionality. Refer to the documentation on the various types of branches and support timelines.

  • On HGX systems, the Fabric Manager and NSCQ libraries must be installed so that DCGM can enumerate NVSwitches and provide NVSwitch telemetry. Refer to the HGX Software Guide.

Remove Older Installations

To remove the previous installation (if any), perform the following steps (e.g. on an RPM-based system).

Make sure that the nv-hostengine is not running. You can stop it using the following command:

$ sudo nv-hostengine -t

Remove the previous installation:

$ sudo yum remove datacenter-gpu-manager



Since the CUDA repository GPG keys were rotated, any old GPG keys should be removed from the system prior to installing packages from the CUDA network repository:

$ sudo apt-key del 7fa2af80

Ubuntu LTS and Debian

Determine the distribution name:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

Download the meta-package to setup the CUDA network repository:

$ wget$distribution/x86_64/cuda-keyring_1.0-1_all.deb

Install the repository meta-data and the CUDA GPG key:

$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

Update the Apt repository cache

$ sudo apt-get update

Now, install DCGM

$ sudo apt-get install -y datacenter-gpu-manager

RHEL / CentOS / Rocky Linux

Note substitute <architecture> as x86_64, sbsa or ppc64le as appropriate:

Determine the distribution name:

$ distribution=$(. /etc/os-release;echo $ID`rpm -E "%{?rhel}%{?fedora}"`)

Install the repository meta-data and the CUDA GPG key:

$ sudo dnf config-manager \

Update the repository metadata

$ sudo dnf clean expire-cache

Now, install DCGM

$ sudo dnf install -y datacenter-gpu-manager


Note substitute <architecture> as x86_64, sbsa or ppc64le as appropriate:

Determine the distribution name:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.[0-9]//')

Install the repository meta-data and the CUDA GPG key:

$ sudo zypper ar \$distribution/x86_64/cuda-$distribution.repo

Update the repository metadata

$ sudo zypper refresh

Now, install DCGM

$ sudo zypper install datacenter-gpu-manager



On HGX systems (A100/A800 and H100/H800), you will need to install the NVIDIA Switch Configuration and Query (NSCQ) library for DCGM to enumerate the NVSwitches and provide telemetry for switches. Refer to the HGX Software Guide for more information.

NSCQ needs to match the driver version branch (XXX) installed on the system. Substitute XXX with the desired driver branch in the commands below.

$ sudo apt-get install -y libnvidia-nscq-XXX


Note that the default nvidia-dcgm.service files included in the installation package use the systemd format. If DCGM is being installed on OS distributions that use the init.d format, then these files may need to be modified.

Enable the DCGM systemd service (on reboot) and start it now

$ sudo systemctl --now enable nvidia-dcgm
● dcgm.service - DCGM service
  Loaded: loaded (/usr/lib/systemd/system/dcgm.service; disabled; vendor preset: enabled)
  Active: active (running) since Mon 2020-10-12 12:18:57 PDT; 14s ago
Main PID: 32847 (nv-hostengine)
    Tasks: 7 (limit: 39321)
  CGroup: /system.slice/dcgm.service
          └─32847 /usr/bin/nv-hostengine -n

Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started

To verify installation, use dcgmi to query the system. You should see a listing of all supported GPUs (and any NVSwitches) found in the system:

$ dcgmi discovery -l
8 GPUs found.
| GPU ID | Device Information                                                   |
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1                |
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7                |
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6                |
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2                |
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8                |
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020                |
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134                |
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df                |
6 NvSwitches found.
| Switch ID |
| 11        |
| 10        |
| 13        |
| 9         |
| 12        |
| 8         |

Basic Components

The DCGM SDK contains these components:

DCGM shared library

The user space shared library,, is the core component of DCGM. This library implements the major underlying functionality and exposes this as a set of C-based APIs. It sits on top of the NVIDIA driver, NVML, and the CUDA Toolkit.

NVIDIA Host Engine

The NVIDIA host engine, nv-hostengine, is a thin wrapper around the DCGM shared library. Its main job is to instantiate the DCGM library as a persistent standalone process, including appropriate management of the monitoring and management activities.


  • DCGM can run as root or non-root. Some DCGM functionality, such as configuration management, are not allowed to be run as non-root.

  • On DGX-2 or HGX-2, nv-hostengine must run as root to enable the Fabric Manager.


The command line interface to DCGM, dcgmi, is a network-capable interface into the NVIDIA host engine. It exposes much of the DCGM functionality in a simple, interactive format. It is intended for users and admins who want to control DCGM, or gather relevant data, without needing to build against the programmatic interfaces. It is not intended for scripting.

Python Bindings

The Python bindings are included with the DCGM package and installed in /usr/local/dcgm/bindings.

Software Development Kit

The DCGM SDK includes examples of how to leverage major DCGM features, alongside API documentation and headers. The SDK includes coverage for both C and Python based APIs, and include examples for using DCGM in both standalone and embedded modes. These are installed in /usr/local/dcgm/sdk_samples.

Modes of Operation

The core DCGM library can be run as a standalone process or be loaded by an agent as a shared library. In both cases it provides roughly the same class of functionality and has the same overall behavior. The choice of mode depends on how it best fits within the user’s existing environment.


In both modes the DCGM library should be run as root. Many features will not work without privileged access to the GPU, including various configuration settings and diagnostics.

Embedded Mode

In this mode the agent is loaded as a shared library. This mode is provided for the following situations:

  • A 3rd-party agent already exists on the node, and

  • Extra jitter associated with an additional autonomous agent needs to be managed

By loading DCGM as a shared library and managing the timing of its activity, 3rd-party agents can control exactly when DCGM is actively using CPU and GPU resources. In this mode the 3rd-party agent should generally load the shared library at system initialization and manage the DCGM infrastructure over the lifetime of the host. Since DCGM is stateful, it is important that the library is maintained over the life of the 3rd-party agent, not invoked in a one-off fashion. In this mode all data gathering loops, management activities, etc. can be explicitly invoked and controlled via library interfaces. A 3rd-party agent may choose, for example, to synchronize DCGM activities across an entire multi-node job in this way.


In this mode it is important that the various DCGM management interfaces be executed by the 3rd-party within the designated frequency ranges, as described in the API definitions. Running too frequently will waste resources with no noticeable gain. Running too infrequently will allow for gaps in monitoring and management coverage.

Working in this mode requires a sequence of setup steps and a management thread within the 3rd-party agent that periodically triggers all necessary DCGM background work. The logic is roughly as follows:

  • On Agent startup

    System or job-level setup, e.g.
    call dcgmGroupCreate() to set up GPU groups
    call dcgmWatchFields() to manage watched metrics
    call dcgmPolicySet() to set policy
  • Periodic Background Tasks (managed)

    Trigger system management behavior, i.e.
    call dcgmUpdateAllFields() to manage metrics
    call dcgmPolicyTrigger() to manage policies
    Gather system data, e.g.
    call dcgmHealthCheck() to check health
    call dcgmGetLatestValues() to get metric updates
  • On Agent shutdown



For a more complete example see the Embedded Mode example in the DCGM SDK

Standalone Mode

In this mode the DCGM agent is embedded in a simple daemon provided by NVIDIA, the NVIDIA Host Engine. This mode is provided for the following situations:

  • DCGM clients prefer to interact with a daemon rather than manage a shared library resource themselves

  • Multiple clients wish to interact with DCGM, rather than a single node agent

  • Users wish to leverage the NVIDIA CLI tool, dcgmi

  • Users of DGX-2 or HGX-2 systems will need to run the Host Engine daemon to configure and monitor the NVSwitches

Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users. In this mode the DCGM library management routines are invoked transparently at default frequencies and with default behaviors, in contrast to the user control provided by the Embedded Mode. Users can either leverage dcgmi tool to interact with the daemon process or load the DCGM library with daemon’s IP address during initialization for programmatic interaction.

The daemon leverages a socket-based interface to speak with external processes, e.g. dcgmi. Users are responsible for configuring the system initialization behavior, post DCGM install, to ensure the daemon is properly executed on startup.


On DGX-2 or HGX-2 systems, nv-hostengine is automatically started at system boot time, so that the Fabric Manager can configure and monitor the NVSwitches.

Static Library

A statically-linked stub version of the DCGM library has been included for the purposes of being able to remove an explicit dependency on the DCGM shared library. This library provides wrappers to the DCGM symbols and uses dlopen() to dynamically access If the shared library is not installed, or cannot be found in the LD_LIBRARY_PATH, an error code is returned. When linking against this library libdl must be included in the compile line which is typically done using:

$ gcc foo.c -o foo -ldcgm_stub -ldl