Data Center GPU Manager User Guide :: Data Center GPU Manager Documentation

Getting Started

Supported Platforms

DCGM currently supports the following products and environments:

All K80 and newer Tesla GPUs
NVSwitch on DGX A100, HGX A100. Note that for DGX-2 and HGX-2 systems, while a minimum version of DCGM 1.7 is required, DCGM 2.0 is recommended.
All Maxwell and newer non-Tesla GPUs
Note: Starting with v1.3, limited DCGM functionality is available on non-Tesla GPUs. More details are available in the section Feature Overview.
CUDA 7.5+ and NVIDIA Driver R418+
NVIDIA Driver R450 and later is required on systems using NVSwitch, such as DGX A100 or HGX A100. Starting with DCGM 2.0, Fabric Manager (FM) for NVSwitch systems is no longer bundled with DCGM packages. FM is a separate artifact that can be installed using the CUDA network repository. For more information, see the Fabric Manager User Guide.
Bare metal and virtualized (full passthrough only)

Supported Linux distributions and architectures are shown in the table below:

Table 1. Linux Distributions and Architectures
Linux Distribution	x86 (x86_64)	Arm (aarch64)	POWER (ppc64le)
RHEL 8.y	X	X	X
RHEL/CentOS 7.y	X
SLES 15	X	X
Ubuntu 20.04 LTS	X	X
Ubuntu 18.04 LTS	X	X	X
Ubuntu 16.04 LTS	X

Installation

To run DCGM the target system must include the following NVIDIA components, listed in dependency order:

Supported NVIDIA Datacenter Driver
Supported CUDA Toolkit
DCGM Runtime and SDK
DCGM Python bindings (if desired)

All of the core components are available as RPMs/DEBs from NVIDIA’s website. The Python bindings are available in the /usr/src/dcgm/bindings directory after installation. The user must be root or have sudo privileges for installation, as for any such packaging operations.

Note: DCGM is tested and designed to run with NVIDIA Datacenter Drivers. Attempting to run on other drivers, such as a developer driver, could result in missing functionality.

To remove the previous installation (if any), perform the following steps (e.g. on an RPM-based system).

Make sure that the nv-hostengine is not running. You can stop it using the following command
```
# sudo nv-hostengine -t
```

Remove the previous installation.

# sudo yum remove datacenter-gpu-manager

It is recommended to install DCGM via Linux package managers from the CUDA network repository.

Install DCGM on Ubuntu

Install repository meta-data. Note substitute <architecture> as x86_64, sbsa or ppc64le as appropriate.

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

$ echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/<architecture> /" | sudo tee /etc/apt/sources.list.d/cuda.list

Trust the CUDA public GPG key

When installing on Ubuntu 20.04/18.04:

$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution>/<architecture>/7fa2af80.pub

When installing on Ubuntu 16.04:

$ sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/$distribution/<architecture>/7fa2af80.pub

Pin file to prioritize CUDA repository:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/<architecture>/cuda-$distribution.pin

$ sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600

Update the Apt repository cache
```
$ sudo apt-get update
```

Install DCGM

$ sudo apt-get install -y datacenter-gpu-manager

Enable DCGM systemd service (on reboot) and start it now
```
$ sudo systemctl --now enable nvidia-dcgm
```

Install DCGM on RHEL

Install repository meta-data and GPG key. Note substitute <architecture> as x86_64, sbsa or ppc64le as appropriate.

$ distribution=$(. /etc/os-release;echo $ID`rpm -E "%{?rhel}%{?fedora}"`)

$ sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution<architecture>/cuda-rhel8.repo

Update metadata
```
$ sudo dnf clean expire-cache
```

Install DCGM

$ sudo dnf install -y datacenter-gpu-manager

Enable DCGM systemd service (on reboot) and start it now
```
$ sudo systemctl --now enable nvidia-dcgm
```

Note that the default nvidia-dcgm.service files included in the installation package use the systemd format. If DCGM is being installed on OS distributions that use the init.d format, then these files may need to be modified.

To verify installation, start the standalone host engine and use dcgmi to query the system. You should see a listing of all supported GPUs:

# nv-hostengine 
Starting host engine using port number : 5555 

# dcgmi discovery -l 
2 GPUs found. 
+--------+-------------------------------------------------------------------+ 
| GPU ID | Device Information                                                | 
+========+===================================================================+ 
| 0      | Name: Tesla K80                                                   |
|        | PCI Bus ID: 0000:07:00.0                                          | 
|        | Device UUID: GPU-000000000000000000000000000000000000             | 
+--------+-------------------------------------------------------------------+ 
| 1      | Name: Tesla K80                                                   | 
|        | PCI Bus ID: 0000:08:00.0                                          | 
|        | Device UUID: GPU-111111111111111111111111111111111115             |
+--------+-------------------------------------------------------------------+ 

# nv-hostengine –t 
Host engine successfully terminated.

Basic Components

DCGM shared library: The user space shared library, libdcgm.so, is the core component of DCGM. This library implements the major underlying functionality and exposes this as a set of C-based APIs. It sits on top of the NVIDIA driver, NVML, and the CUDA Toolkit.

NVIDIA Host Engine

The NVIDIA host engine, nv-hostengine, is a thin wrapper around the DCGM shared library. Its main job is to instantiate the DCGM library as a persistent standalone process, including appropriate management of the monitoring and management activities.

Note:

DCGM can run as root or non-root. Some DCGM functionality, such as configuration management, are not allowed to be run as non-root.
On DGX-2 or HGX-2, nv-hostengine must run as root to enable the Fabric Manager.

DCGM CLI Tool: The command line interface to DCGM, dcgmi, is a network-capable interface into the NVIDIA host engine. It exposes much of the DCGM functionality in a simple, interactive format. It is intended for users and admins who want to control DCGM, or gather relevant data, without needing to build against the programmatic interfaces. It is not intended for scripting.

Python Bindings: The Python bindings are included with the DCGM package and installed in /usr/src/dcgm/bindings.

Software Development Kit: The DCGM SDK includes examples of how to leverage major DCGM features, alongside API documentation and headers. The SDK includes coverage for both C and Python based APIs, and include examples for using DCGM in both standalone and embedded modes.; These are installed in /usr/src/dcgm/sdk_samples.

Modes of Operation

The core DCGM library can be run as a standalone process or be loaded by an agent as a shared library. In both cases it provides roughly the same class of functionality and has the same overall behavior. The choice of mode depends on how it best fits within the user’s existing environment.

Note: In both modes the DCGM library should be run as root. Many features will not work without privileged access to the GPU, including various configuration settings and diagnostics.

Embedded Mode

In this mode the agent is loaded as a shared library. This mode is provided for the following situations:

A 3rd-party agent already exists on the node, and
Extra jitter associated with an additional autonomous agent needs to be managed

By loading DCGM as a shared library and managing the timing of its activity, 3rd-party agents can control exactly when DCGM is actively using CPU and GPU resources.

In this mode the 3rd-party agent should generally load the shared library at system initialization and manage the DCGM infrastructure over the lifetime of the host. Since DCGM is stateful, it is important that the library is maintained over the life of the 3rd-party agent, not invoked in a one-off fashion. In this mode all data gathering loops, management activities, etc. can be explicitly invoked and controlled via library interfaces. A 3rd-party agent may choose, for example, to synchronize DCGM activities across an entire multi-node job in this way.

CAUTION:

In this mode it is important that the various DCGM management interfaces be executed by the 3rd-party within the designated frequency ranges, as described in the API definitions. Running too frequently will waste resources with no noticeable gain. Running too infrequently will allow for gaps in monitoring and management coverage.

Working in this mode requires a sequence of setup steps and a management thread within the 3rd-party agent that periodically triggers all necessary DCGM background work. The logic is roughly as follows:

On Agent Startup

dcgmInit() 

System or job-level setup, e.g. 
call dcgmGroupCreate() to set up GPU groups 
call dcgmWatchFields() to manage watched metrics 
call dcgmPolicySet() to set policy

Periodic Background Tasks (managed)

Trigger system management behavior, i.e.
 call dcgmUpdateAllFields() to manage metrics
 call dcgmPolicyTrigger() to manage policies 

Gather system data, e.g.
 call dcgmHealthCheck() to check health
 call dcgmGetLatestValues() to get metric updates

On Agent Shutdown
```
dcgmShutdown()
```

Note: For a more complete example see the Embedded Mode example in the DCGM SDK

Standalone Mode

In this mode the DCGM agent is embedded in a simple daemon provided by NVIDIA, the NVIDIA Host Engine. This mode is provided for the following situations:

DCGM clients prefer to interact with a daemon rather than manage a shared library resource themselves
Multiple clients wish to interact with DCGM, rather than a single node agent
Users wish to leverage the NVIDIA CLI tool, DCGMI
Users of DGX-2 or HGX-2 systems will need to run the Host Engine daemon to configure and monitor the NVSwitches

Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users. In this mode the DCGM library management routines are invoked transparently at default frequencies and with default behaviors, in contrast to the user control provided by the Embedded Mode. Users can either leverage DCGMI tool to interact with the daemon process or load the DCGM library with daemon’s IP address during initialization for programmatic interaction.

The daemon leverages a socket-based interface to speak with external processes, e.g. DCGMI. Users are responsible for configuring the system initialization behavior, post DCGM install, to ensure the daemon is properly executed on startup.

Note:

Helper installation scripts for daemon setup will be included in the next Release Candidate package.
On DGX-2 or HGX-2 systems, nv-hostengine is automatically started at system boot time, so that the Fabric Manager can configure and monitor the NVSwitches.

Static Library

A statically-linked stub version of the DCGM library has been included for the purposes of being able to remove an explicit dependency on the DCGM shared library. This library provides wrappers to the DCGM symbols and uses dlopen() to dynamically access libdcgm.so. If the shared library is not installed, or cannot be found in the LD_LIBRARY_PATH, an error code is returned. When linking against this library libdl must be included in the compile line which is typically done using:

# gcc foo.c –o foo –ldcgm_stub -ldl