Getting Started

Supported Platforms

DCGM currently supports the following products and environments:

All NVIDIA Kepler™ (K80) and newer NVIDIA datacenter (previously, NVIDIA Tesla®) GPUs
NVIDIA® NVSwitch™ on NVIDIA DGX™ A100, NVIDIA HGX™ A100 and newer.
All NVIDIA Maxwell™ and newer non-datacenter (e.g. NVIDIA® GeForce® or NVIDIA® Quadro®) GPUs
CUDA® 7.5+ and NVIDIA Driver R450+
Bare metal and virtualized (full passthrough only)

Note

NVIDIA Driver R450 and later is required on systems using NVSwitch™, such as NVIDIA DGX™ A100 or NVIDIA HGX™ A100. Starting with DCGM 2.0, Fabric Manager (FM) for NVSwitch™ systems is no longer bundled with DCGM packages. FM is a separate artifact that can be installed using the CUDA network repository. For more information, see the Fabric Manager User Guide.
Starting with v1.3, limited DCGM functionality is available on non-datacenter GPUs. More details are available in the section Feature Overview.

Supported Linux Distributions

Linux Distributions and Architectures
Linux Distribution	x86_64	aarch64
Amazon Linux 2023	✓	✓
Azure Linux 3.0	✓	✓
Debian 12	✓	✓
RHEL 8.y/Rocky Linux 8.y	✓	✓
RHEL 9.y/Rocky Linux 9.y	✓	✓
RHEL 10.y/Rocky Linux 10.y	✓	✓
SLES/OpenSUSE 15.y	✓	✓
Ubuntu 22.04 LTS	✓	✓
Ubuntu 24.04 LTS	✓	✓

Installation

System Requirements

Note

Ensuring your environment meets these requirements is equally important for containers, virtual machines, and baremetal solutions. Attempting to run DCGM in an environment that does not meet these requirements may not yield a successful outcome.

Resource	Requirement
Minimum System Memory (Host RAM)	>= 16GB
Minimum CPU Cores	>= Number of GPUs

Pre-Requisites

The system package manager has been configured to use an NVIDIA package registry for the system’s Linux distribution. For those using a CUDA local package registry on disk, it is recommended to update to the latest version available.

Please refer to the CUDA installation guide for detailed steps.
Installations of the following NVIDIA software must be present on the system:
1. A supported NVIDIA Datacenter Driver
  
  Warning
  
  DCGM is tested and designed to run with NVIDIA Datacenter Drivers. Attempting to run on other drivers, such as a developer driver, could result in missing functionality.
  
  Please refer to the documentation on the various types of branches and support timelines.
2. On systems with NVSwitch™ hardware, such as NVIDIA DGX™ systems and NVIDIA HGX™ systems,
  the Fabric Manager package
  
  the NVSwitch™ Configuration & Query (NSCQ) package for Hopper or earlier generation GPUs
  
  the NVIDIA Switch Device Monitoring (NVSDM) package for Blackwell or later generation GPUs
  Note
  
  For each NVIDIA driver major version, there is a corresponding Fabric Manager, NSCQ package and NVSDM package, e.g. for driver major version 565, the corresponding NSCQ package is libnvidia-nscq-565 and the corresponding NVSDM package is libnvsdm-565.
  
  See the output of nvidia-smi for the version of the NVIDIA driver installed on the system.
  
  For more information regarding the Fabric Manager package, please refer to the Fabric Manager User Guide
  
  For more information regarding the NSCQ package, please refer to the HGX Software Guide.
  
  For more information regarding the NVIDIA Switch Device Monitoring package, please refer to the NVSDM User Guide and Driver Installation Guide.
A user needs sufficient permissions to install packages through the system package manager. This could be either a root user or a user with appropriate sudo privileges.

Any existing datacenter gpu manager system services have been stopped, e.g.

$ sudo systemctl list-unit-files nvidia-dcgm.service > /dev/null && \
  sudo systemctl stop nvidia-dcgm

Determine the CUDA major version appropriate for the system

For each NVIDIA GPU driver, there is a corresponding CUDA user-mode driver. DCGM binaries targeting a specific major version of the user-mode driver are segregated into dedicated packages. The user-mode driver major version targeted by binaries installed by a given DCGM package is indicated by a -cuda* suffix on the package name, e.g. datacenter-gpu-manager-4-cuda12.

Generally speaking, users should install binaries targeting the major version of the CUDA user-mode driver that’s installed on their system.

The respective versions of the NVIDIA GPU driver and associated CUDA user-mode driver can queried using the nvidia-smi command-line utility, e.g.
```
$ nvidia-smi -q | grep -E 'Driver Version|CUDA Version'
Driver Version                            : 575.57.08
CUDA Version                              : 12.9
```
In the example above, the major version of the CUDA user-mode driver is 12.
Tip

In scripting contexts, the sed command can be used to extract this value programmatically, e.g.
```
$ CUDA_VERSION=$(nvidia-smi -q | sed -E -n 's/CUDA Version[ :]+([0-9]+)[.].*/\1/p')
```
Warning

While Maxwell, Volta, and Pascal generation GPUs are supported by version 580 of the NVIDIA GPU driver and associated CUDA 13.0 user-mode driver, these GPU generations are not support by version 13.0.0 of the CUDA Toolkit. Consequently, Maxwell, Volta, and Pascal systems using driver version 580 should install DCGM packages targeting major version 12 of the user-mode driver (e.g. datacenter-gpu-manager-4-cuda12) rather than DCGM packages targeting major version 13.

Installation

Amazon Linux 2023

Install the datacenter-gpu-manager-4 package corresponding to the CUDA user-mode driver major version, its dependencies, and respective associated recommended packages.
```
$ CUDA_VERSION=<major version of CUDA user-mode driver>
$ sudo dnf install --assumeyes \
                   --setopt=install_weak_deps=True \
                   datacenter-gpu-manager-4-cuda${CUDA_VERSION}
```
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --setopt=install_weak_deps=True with --setopt=install_weak_deps=False.
(Optional) Install the datacenter-gpu-manager-4 multinode diagnostic plugin
```
$ sudo dnf install --assumeyes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}
```
Note

The multinode diagnostic plugin is available only for CUDA version 12 or later

(Optional) Install the datacenter-gpu-manager-4 development files

$ sudo dnf install --assumeyes datacenter-gpu-manager-4-dev

Azure Linux 3.0

Install the datacenter-gpu-manager-4 package corresponding to the CUDA user-mode driver major version, its dependencies, and respective associated recommended packages.
```
$ CUDA_VERSION=<major version of CUDA user-mode driver>
$ sudo tdnf install --assumeyes \
                    --setopt=install_weak_deps=True \
                    datacenter-gpu-manager-4-cuda${CUDA_VERSION}
```
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --setopt=install_weak_deps=True with --setopt=install_weak_deps=False.
(Optional) Install the datacenter-gpu-manager-4 multinode diagnostic plugin
```
$ sudo tdnf install --assumeyes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}
```
Note

The multinode diagnostic plugin is available only for CUDA version 12 or later

(Optional) Install the datacenter-gpu-manager-4 development files

$ sudo tdnf install --assumeyes datacenter-gpu-manager-4-dev

Ubuntu LTS and Debian

Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages

$ sudo dpkg --list datacenter-gpu-manager &> /dev/null && \
  sudo apt purge --yes datacenter-gpu-manager

$ sudo dpkg --list datacenter-gpu-manager-config &> /dev/null && \
  sudo apt purge --yes datacenter-gpu-manager-config

Update the package registry cache
```
$ sudo apt-get update
```
Install the datacenter-gpu-manager-4 package corresponding to the CUDA user-mode driver major version, its dependencies, and respective associated recommended packages.
```
$ CUDA_VERSION=<major version of CUDA user-mode driver>
$ sudo apt-get install --yes \
                       --install-recommends \
                       datacenter-gpu-manager-4-cuda${CUDA_VERSION}
```
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --install-recommends with --no-install-recommends.
(Optional) Install the datacenter-gpu-manager-4 multinode diagnostic plugin
```
$ sudo apt install --yes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}
```
Note

The multinode diagnostic plugin is available only for CUDA version 12 or later
(Optional) Install the datacenter-gpu-manager-4 development files
```
$ sudo apt install --yes datacenter-gpu-manager-4-dev
```

RHEL / CentOS / Rocky Linux

Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages.

$ sudo dnf list --installed datacenter-gpu-manager &> /dev/null && \
  sudo dnf remove --assumeyes datacenter-gpu-manager

$ sudo dnf list --installed datacenter-gpu-manager-config &> /dev/null && \
  sudo dnf remove --assumeyes datacenter-gpu-manager-config

Update the package registry cache.
```
$ sudo dnf clean expire-cache
```
Install the datacenter-gpu-manager-4 package corresponding to the CUDA user-mode driver major version, its dependencies, and respective associated recommended packages.
```
$ CUDA_VERSION=<major version of CUDA user-mode driver>
$ sudo dnf install --assumeyes \
                   --setopt=install_weak_deps=True \
                   datacenter-gpu-manager-4-cuda${CUDA_VERSION}
```
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --setopt=install_weak_deps=True with --setopt=install_weak_deps=False.
(Optional) Install the datacenter-gpu-manager-4 multinode diagnostic plugin
```
$ sudo dnf install --assumeyes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}
```
Note

The multinode diagnostic plugin is available only for CUDA version 12 or later

(Optional) Install the datacenter-gpu-manager-4 development files

$ sudo dnf install --assumeyes datacenter-gpu-manager-4-devel

SUSE SLES / OpenSUSE

Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages

$ sudo zypper search --installed-only --match-exact datacenter-gpu-manager &> /dev/null && \
  sudo zypper --non-interactive remove datacenter-gpu-manager

$ sudo zypper search --installed-only --match-exact datacenter-gpu-manager-config && \
  sudo zypper --non-interactive remove datacenter-gpu-manager-config

Update the package registry cache
```
$ sudo zypper refresh
```
Install the datacenter-gpu-manager-4 package corresponding to the CUDA user-mode driver major version, its dependencies, and respective associated recommended packages.
```
$ CUDA_VERSION=<major version of CUDA user-mode driver>
$ sudo zypper install --no-confirm \
                      --recommends \
                      datacenter-gpu-manager-4-cuda${CUDA_VERSION}
```
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace --recommends with --no-recommends.
(Optional) Install the datacenter-gpu-manager-4 multinode diagnostic plugin
```
$ sudo zypper install --no-confirm datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}
```
Note

The multinode diagnostic plugin is available only for CUDA version 12 or later

(Optional) Install the datacenter-gpu-manager-4 development files

$ sudo zypper install --no-confirm datacenter-gpu-manager-4-devel

Post-Install

Note

Note that the default nvidia-dcgm.service files included in the installation package use the systemd format. If DCGM is being installed on OS distributions that use the init.d format, then these files will need to be modified.

Enable the DCGM systemd service (on reboot) and start it now

$ sudo systemctl --now enable nvidia-dcgm

● dcgm.service - DCGM service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
  Active: active (running) since Mon 2024-12-17 12:18:57 EDT; 14s ago
Main PID: 32847 (nv-hostengine)
    Tasks: 7 (limit: 39321)
  CGroup: /system.slice/nvidia-dcgm.service
          └─32847 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started

To verify installation, use dcgmi to query the system. You should see a listing of all supported GPUs (and any NVSwitches) found in the system:

$ dcgmi discovery -l

8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1                |
+--------+----------------------------------------------------------------------+
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7                |
+--------+----------------------------------------------------------------------+
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6                |
+--------+----------------------------------------------------------------------+
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2                |
+--------+----------------------------------------------------------------------+
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8                |
+--------+----------------------------------------------------------------------+
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020                |
+--------+----------------------------------------------------------------------+
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134                |
+--------+----------------------------------------------------------------------+
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df                |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 11        |
| 10        |
| 13        |
| 9         |
| 12        |
| 8         |
+-----------+

Basic Components

The DCGM SDK contains these components:

DCGM shared library

The user space shared library, libdcgm.so.4, is the core component of DCGM. This library implements the major underlying functionality and exposes this as a set of C-based APIs. It sits on top of the NVIDIA driver, NVML, and the CUDA Toolkit.

NVIDIA Host Engine

The NVIDIA host engine, nv-hostengine, is a thin wrapper around the DCGM shared library. Its main job is to instantiate the DCGM library as a persistent standalone process, including appropriate management of the monitoring and management activities.

Note

DCGM can run as root or non-root. Some DCGM functionality, such as configuration management, are not allowed to be run as non-root.
On DGX-2 or HGX-2, nv-hostengine must run as root to enable the Fabric Manager.

DCGM CLI Tool

The command line interface to DCGM, dcgmi, is a network-capable interface into the NVIDIA host engine. It exposes much of the DCGM functionality in a simple, interactive format. It is intended for users and admins who want to control DCGM, or gather relevant data, without needing to build against the programmatic interfaces. It is not intended for scripting.

Python Bindings

The Python bindings are included with the DCGM package and installed in /usr/share/datacenter-gpu-manager-4/bindings/python3/.

Software Development Kit

The DCGM SDK includes examples of how to leverage major DCGM features, alongside API documentation and headers. The SDK includes coverage for both C and Python based APIs, and include examples for using DCGM in both standalone and embedded modes.

These are installed in /usr/src/datacenter-gpu-manager-4/sdk_samples.

Modes of Operation

The core DCGM library can be run as a standalone process or be loaded by an agent as a shared library. In both cases it provides roughly the same class of functionality and has the same overall behavior. The choice of mode depends on how it best fits within the user’s existing environment.

Note

In both modes the DCGM library should be run as root. Many features will not work without privileged access to the GPU, including various configuration settings and diagnostics.

Embedded Mode

In this mode the agent is loaded as a shared library. This mode is provided for the following situations:

A 3rd-party agent already exists on the node, and
Extra jitter associated with an additional autonomous agent needs to be managed

By loading DCGM as a shared library and managing the timing of its activity, 3rd-party agents can control exactly when DCGM is actively using CPU and GPU resources. In this mode the 3rd-party agent should generally load the shared library at system initialization and manage the DCGM infrastructure over the lifetime of the host. Since DCGM is stateful, it is important that the library is maintained over the life of the 3rd-party agent, not invoked in a one-off fashion. In this mode all data gathering loops, management activities, etc. can be explicitly invoked and controlled via library interfaces. A 3rd-party agent may choose, for example, to synchronize DCGM activities across an entire multi-node job in this way.

Warning

In this mode it is important that the various DCGM management interfaces be executed by the 3rd-party within the designated frequency ranges, as described in the API definitions. Running too frequently will waste resources with no noticeable gain. Running too infrequently will allow for gaps in monitoring and management coverage.

Working in this mode requires a sequence of setup steps and a management thread within the 3rd-party agent that periodically triggers all necessary DCGM background work. The logic is roughly as follows:

On Agent startup

dcgmInit()

System or job-level setup, e.g.
call dcgmGroupCreate() to set up GPU groups
call dcgmWatchFields() to manage watched metrics
call dcgmPolicySet() to set policy

Periodic Background Tasks (managed)

Trigger system management behavior, i.e.
call dcgmUpdateAllFields() to manage metrics
call dcgmPolicyTrigger() to manage policies

Gather system data, e.g.
call dcgmHealthCheck() to check health
call dcgmGetLatestValues() to get metric updates

On Agent shutdown
```
dcgmShutdown()
```

Note

For a more complete example see the Embedded Mode example in the DCGM SDK

Standalone Mode

In this mode the DCGM agent is embedded in a simple daemon provided by NVIDIA, the NVIDIA Host Engine. This mode is provided for the following situations:

DCGM clients prefer to interact with a daemon rather than manage a shared library resource themselves
Multiple clients wish to interact with DCGM, rather than a single node agent
Users wish to leverage the NVIDIA CLI tool, dcgmi
Users of DGX-2 or HGX-2 systems will need to run the Host Engine daemon to configure and monitor the NVSwitches

Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users. In this mode the DCGM library management routines are invoked transparently at default frequencies and with default behaviors, in contrast to the user control provided by the Embedded Mode. Users can either leverage dcgmi tool to interact with the daemon process or load the DCGM library with daemon’s IP address during initialization for programmatic interaction.

The daemon leverages a socket-based interface to speak with external processes, e.g. dcgmi. Users are responsible for configuring the system initialization behavior, post DCGM install, to ensure the daemon is properly executed on startup.

Note

On DGX-2 or HGX-2 systems, nv-hostengine is automatically started at system boot time, so that the Fabric Manager can configure and monitor the NVSwitches.

Static Library

A statically-linked stub version of the DCGM library has been included for the purposes of being able to remove an explicit dependency on the DCGM shared library. This library provides wrappers to the DCGM symbols and uses dlopen() to dynamically access libdcgm.so.4. If the shared library is not installed, or cannot be found in the LD_LIBRARY_PATH, an error code is returned. When linking against this library libdl must be included in the compile line which is typically done using:

$ gcc foo.c -o foo -ldcgm_stub -ldl