Getting Started
Supported Platforms
DCGM currently supports the following products and environments:
All NVIDIA Kepler™ (K80) and newer NVIDIA datacenter (previously, NVIDIA Tesla®) GPUs
NVIDIA® NVSwitch™ on NVIDIA DGX™ A100, NVIDIA HGX™ A100 and newer.
All NVIDIA Maxwell™ and newer non-datacenter (e.g. NVIDIA® GeForce® or NVIDIA® Quadro®) GPUs
CUDA® 7.5+ and NVIDIA Driver R450+
Bare metal and virtualized (full passthrough only)
Note
NVIDIA Driver R450 and later is required on systems using NVSwitch™, such as NVIDIA DGX™ A100 or NVIDIA HGX™ A100. Starting with DCGM 2.0, Fabric Manager (FM) for NVSwitch™ systems is no longer bundled with DCGM packages. FM is a separate artifact that can be installed using the CUDA network repository. For more information, see the Fabric Manager User Guide.
Starting with v1.3, limited DCGM functionality is available on non-datacenter GPUs. More details are available in the section Feature Overview.
Supported Linux Distributions
Linux Distribution |
x86 (x86_64) |
Arm64 (aarch64) |
---|---|---|
Debian 12 |
X |
|
RHEL 8.y/Rocky Linux 8.y |
X |
X |
RHEL 9.y/Rocky Linux 9.y |
X |
X |
SLES/OpenSUSE 15.y |
X |
X |
Ubuntu 24.04 LTS |
X |
X |
Ubuntu 22.04 LTS |
X |
X |
Ubuntu 20.04 LTS |
X |
X |
Installation
Pre-Requisites
The system package manager has been configured to use an NVIDIA package registry for the system’s Linux distribution. For those using a CUDA local package registry on disk, it is recommended to update to the latest version available.
Please refer to the CUDA installation guide for detailed steps.
Installations of the following NVIDIA software must be present on the system:
A supported NVIDIA Datacenter Driver
Warning
DCGM is tested and designed to run with NVIDIA Datacenter Drivers. Attempting to run on other drivers, such as a developer driver, could result in missing functionality.
Please refer to the documentation on the various types of branches and support timelines.
On systems with NVSwitch™ hardware, such as NVIDIA DGX™ systems and NVIDIA HGX™ systems,
the Fabric Manager package
the NVSwitch™ Configuration & Query (NSCQ) package
Note
For each NVIDIA driver major version, there is a corresponding Fabric Manager and a corresponding NSCQ package, e.g. for driver major version 550, the corresponding NSCQ package is
libnvidia-nscq-550
.See the output of
nvidia-smi
for the version of the NVIDIA driver installed on the system.For more information regarding the NSCQ package, please refer to the HGX Software Guide.
For more information regarding the Fabric Manager package, please refer to the Fabric Manager User Guide
A user needs sufficient permissions to install packages through the system package manager. This could be either a root user or a user with appropriate sudo privileges.
Any existing datacenter gpu manager system services have been stopped, e.g.
$ sudo systemctl list-unit-files nvidia-dcgm.service > /dev/null && \ sudo systemctl stop nvidia-dcgm
For best results, the system should have at least
16GB of system memory (host RAM)
as many CPU cores as GPUs.
Tip
The number of CPU cores can be determined by running
nproc
.
Installation
Ubuntu LTS and Debian
Remove any installations of the
datacenter-gpu-manager
anddatacenter-gpu-manager-config
packages$ sudo dpkg --list datacenter-gpu-manager &> /dev/null && \ sudo apt purge --yes datacenter-gpu-manager $ sudo dpkg --list datacenter-gpu-manager-config &> /dev/null && \ sudo apt purge --yes datacenter-gpu-manager-config
Update the package registry cache
$ sudo apt-get update
Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version
$ CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p') $ sudo apt-get install --yes \ --install-recommends \ datacenter-gpu-manager-4-cuda${CUDA_VERSION}
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace
--install-recommends
with--no-install-recommends
.(Optional) Install the datacenter-gpu-manager-4 development files
$ sudo apt install --yes datacenter-gpu-manager-4-dev
RHEL / CentOS / Rocky Linux
Remove any installations of the
datacenter-gpu-manager
anddatacenter-gpu-manager-config
packages.$ sudo dnf list --installed datacenter-gpu-manager &> /dev/null && \ sudo dnf remove --assumeyes datacenter-gpu-manager $ sudo dnf list --installed datacenter-gpu-manager-config &> /dev/null && \ sudo dnf remove --assumeyes datacenter-gpu-manager-config
Update the package registry cache.
$ sudo dnf clean expire-cache
Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version, its dependencies, and respective associated recommended packages.
$ CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p') $ sudo dnf install --assumeyes \ --setopt=install_weak_deps=True \ datacenter-gpu-manager-4-cuda${CUDA_VERSION}
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace
--setopt=install_weak_deps=True
with--setopt=install_weak_deps=False
.(Optional) Install the datacenter-gpu-manager-4 development files
$ sudo dnf install --assumeyes datacenter-gpu-manager-4-devel
SUSE SLES / OpenSUSE
Remove any installations of the
datacenter-gpu-manager
anddatacenter-gpu-manager-config
packages$ sudo zypper search --installed-only --match-exact datacenter-gpu-manager &> /dev/null && \ sudo zypper --non-interactive remove datacenter-gpu-manager $ sudo zypper search --installed-only --match-exact datacenter-gpu-manager-config && \ sudo zypper --non-interactive remove datacenter-gpu-manager-config
Update the package registry cache
$ sudo zypper refresh
Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version, its dependencies, and respective associated recommended packages.
$ CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p') $ sudo zypper install --no-confirm \ --recommends \ datacenter-gpu-manager-4-cuda${CUDA_VERSION}
Installing the recommended packages provides additional DCGM functionality which is not present in the DCGM opensource product. To opt out of these packages and the associated functionality, replace
--recommends
with--no-recommends
.(Optional) Install the datacenter-gpu-manager-4 development files
$ sudo zypper install --no-confirm datacenter-gpu-manager-4-devel
Post-Install
Note
Note that the default nvidia-dcgm.service
files included in the installation package use the
systemd
format. If DCGM is being installed on OS distributions that use the init.d
format,
then these files will need to be modified.
Enable the DCGM systemd service (on reboot) and start it now
$ sudo systemctl --now enable nvidia-dcgm
● dcgm.service - DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2024-12-17 12:18:57 EDT; 14s ago
Main PID: 32847 (nv-hostengine)
Tasks: 7 (limit: 39321)
CGroup: /system.slice/nvidia-dcgm.service
└─32847 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started
To verify installation, use dcgmi
to query the system. You should see a listing of all supported GPUs
(and any NVSwitches) found in the system:
$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:07:00.0 |
| | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1 |
+--------+----------------------------------------------------------------------+
| 1 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:0F:00.0 |
| | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7 |
+--------+----------------------------------------------------------------------+
| 2 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:47:00.0 |
| | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6 |
+--------+----------------------------------------------------------------------+
| 3 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:4E:00.0 |
| | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2 |
+--------+----------------------------------------------------------------------+
| 4 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:87:00.0 |
| | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8 |
+--------+----------------------------------------------------------------------+
| 5 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:00.0 |
| | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020 |
+--------+----------------------------------------------------------------------+
| 6 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:B7:00.0 |
| | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134 |
+--------+----------------------------------------------------------------------+
| 7 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:BD:00.0 |
| | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 11 |
| 10 |
| 13 |
| 9 |
| 12 |
| 8 |
+-----------+
Basic Components
The DCGM SDK contains these components:
NVIDIA Host Engine
The NVIDIA host engine, nv-hostengine
, is a thin wrapper around the DCGM shared library.
Its main job is to instantiate the DCGM library as a persistent standalone process,
including appropriate management of the monitoring and management activities.
Note
DCGM can run as root or non-root. Some DCGM functionality, such as configuration management, are not allowed to be run as non-root.
On DGX-2 or HGX-2, nv-hostengine must run as root to enable the Fabric Manager.
DCGM CLI Tool
The command line interface to DCGM, dcgmi
, is a network-capable interface into the
NVIDIA host engine. It exposes much of the DCGM functionality in a simple, interactive format.
It is intended for users and admins who want to control DCGM, or gather relevant data,
without needing to build against the programmatic interfaces. It is not intended for scripting.
Python Bindings
The Python bindings are included with the DCGM package and installed in
/usr/share/datacenter-gpu-manager-4/bindings/python3/
.
Software Development Kit
The DCGM SDK includes examples of how to leverage major DCGM features, alongside API documentation and headers. The SDK includes coverage for both C and Python based APIs, and include examples for using DCGM in both standalone and embedded modes.
These are installed in /usr/src/datacenter-gpu-manager-4/sdk_samples
.
Modes of Operation
The core DCGM library can be run as a standalone process or be loaded by an agent as a shared library. In both cases it provides roughly the same class of functionality and has the same overall behavior. The choice of mode depends on how it best fits within the user’s existing environment.
Note
In both modes the DCGM library should be run as root. Many features will not work without privileged access to the GPU, including various configuration settings and diagnostics.
Embedded Mode
In this mode the agent is loaded as a shared library. This mode is provided for the following situations:
A 3rd-party agent already exists on the node, and
Extra jitter associated with an additional autonomous agent needs to be managed
By loading DCGM as a shared library and managing the timing of its activity, 3rd-party agents can control exactly when DCGM is actively using CPU and GPU resources. In this mode the 3rd-party agent should generally load the shared library at system initialization and manage the DCGM infrastructure over the lifetime of the host. Since DCGM is stateful, it is important that the library is maintained over the life of the 3rd-party agent, not invoked in a one-off fashion. In this mode all data gathering loops, management activities, etc. can be explicitly invoked and controlled via library interfaces. A 3rd-party agent may choose, for example, to synchronize DCGM activities across an entire multi-node job in this way.
Warning
In this mode it is important that the various DCGM management interfaces be executed by the 3rd-party within the designated frequency ranges, as described in the API definitions. Running too frequently will waste resources with no noticeable gain. Running too infrequently will allow for gaps in monitoring and management coverage.
Working in this mode requires a sequence of setup steps and a management thread within the 3rd-party agent that periodically triggers all necessary DCGM background work. The logic is roughly as follows:
On Agent startup
dcgmInit() System or job-level setup, e.g. call dcgmGroupCreate() to set up GPU groups call dcgmWatchFields() to manage watched metrics call dcgmPolicySet() to set policy
Periodic Background Tasks (managed)
Trigger system management behavior, i.e. call dcgmUpdateAllFields() to manage metrics call dcgmPolicyTrigger() to manage policies Gather system data, e.g. call dcgmHealthCheck() to check health call dcgmGetLatestValues() to get metric updates
On Agent shutdown
dcgmShutdown()
Note
For a more complete example see the Embedded Mode example in the DCGM SDK
Standalone Mode
In this mode the DCGM agent is embedded in a simple daemon provided by NVIDIA, the NVIDIA Host Engine. This mode is provided for the following situations:
DCGM clients prefer to interact with a daemon rather than manage a shared library resource themselves
Multiple clients wish to interact with DCGM, rather than a single node agent
Users wish to leverage the NVIDIA CLI tool,
dcgmi
Users of DGX-2 or HGX-2 systems will need to run the Host Engine daemon to configure and monitor the NVSwitches
Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility
and lowest maintenance cost to users. In this mode the DCGM library management routines
are invoked transparently at default frequencies and with default behaviors, in contrast
to the user control provided by the Embedded Mode. Users can either leverage dcgmi
tool
to interact with the daemon process or load the DCGM library with daemon’s IP address during
initialization for programmatic interaction.
The daemon leverages a socket-based interface to speak with external processes, e.g. dcgmi
.
Users are responsible for configuring the system initialization behavior, post DCGM install, to
ensure the daemon is properly executed on startup.
Note
On DGX-2 or HGX-2 systems, nv-hostengine is automatically started at system boot time, so that the Fabric Manager can configure and monitor the NVSwitches.
Static Library
A statically-linked stub version of the DCGM library has been included for the purposes of being
able to remove an explicit dependency on the DCGM shared library. This library provides wrappers to
the DCGM symbols and uses dlopen()
to dynamically access libdcgm.so.4
. If the shared library
is not installed, or cannot be found in the LD_LIBRARY_PATH
, an error code is returned. When
linking against this library libdl
must be included in the compile line which is typically done
using:
$ gcc foo.c -o foo -ldcgm_stub -ldl