Overview

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:

  • GPU behavior monitoring

  • GPU configuration management

  • GPU policy oversight

  • GPU health and diagnostics

  • GPU accounting and process statistics

  • NVSwitch configuration and monitoring

This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.

Terminology

Terms used in this document

Term

Meaning

DCGM

NVIDIA’s Datacenter GPU Manager

NVIDIA Host Engine

Standalone executable wrapper for DCGM shared library

Host Engine daemon

Daemon mode of operation for the NVIDIA Host Engine

Fabric Manager

A module within the Host Engine daemon that supports NVSwitch fabric on DGX-2 or HGX-2.

3rd-party DCGM Agent

Any node-level process from a 3rd-party that runs DCGM in Embedded Mode

Embedded Mode

DCGM executing as a shared library within a 3rd-party DCGM agent

Standalone Mode

DCGM executing as a standalone process via the Host Engine

System Validation

Health checks encompassing the GPU, board and surrounding environment

HW diagnostic

System validation component focusing on GPU hardware correctness

RAS event

Reliability, Availability, Serviceability event. Corresponding to both fatal and non-fatal GPU issues

NVML

NVIDIA Management Library

Focus Areas

DCGM’s design is geared towards the following key functional areas.

Provide robust, online health and diagnostics

The ability to ascertain the health of a GPU and its interaction with the surrounding system is a critical management need. This need comes in various forms, from passive background monitoring to quick system validation to extensive hardware diagnostics. In all cases it is important to provide these features with minimal impact on the system and minimal additional environmental requirements. DCGM provides extensive automated and non-automated health and diagnostic capabilities.

Enable job-level statistics and continuous GPU telemetry

Understanding GPU usage is important for schedulers and resource managers. Tying this information together with RAS events, performance information and other telemetry, especially at the boundaries of a workload, is very useful in explaining job behavior and root-causing potential performance or execution issues. DCGM provides mechanism to gather, group and analyze data at the job level. DCGM also provides continuous GPU telemetry at very low performance overheads.

Configure NVSwitches

On DGX-2 or HGX-2, all GPUs communicate by way of NVSwitch. The Fabric Manager component of DCGM configures the switches to form a single memory fabric among all participating GPUs, and monitors the NVLinks that support the fabric.

Note

As of v2.x, DCGM no longer includes the Fabric Manager, which is a separate component that needs to be installed for NVSwitch based systems.

Define and enforce GPU configuration state

The behavior of NVIDIA GPUs can be controlled by users to match requirements of particular environments or applications. This includes performance characteristics such as clock settings, exclusivity constraints like compute mode, and environmental controls like power limits. DCGM provides enforcement and persistence mechanisms to ensure behavioral consistency across related GPUs.

Automate GPU management policies

NVIDIA GPUs have advanced capabilities that facilitate error containment and identify problem areas. Automated policies that define GPU response to certain classes of events, including recovery from errors and isolation of bad hardware, ensure higher reliability and a simplified administration environment. DCGM provides policies for common situations that require notification or automated action.

Target Users

DCGM is targeted at the following users:

  • OEMs and ISVs wishing to improve GPU integration within their software.

  • Datacenter admins managing their own GPU enabled infrastructure.

  • Individual users and FAEs needing better insight into GPU behavior, especially during problem analysis.

  • All DGX-2 and HGX-2 users will use the Fabric Manager to configure and monitor the NVSwitch fabric.

DCGM provides different interfaces to serve different consumers and use cases. Programmatic access via C, Python and Go is geared towards integration with 3rd-party software. Python interfaces are also geared towards admin-centric scripting environments. CLI-based tools are present to provide an interactive out-of-the-box experience for end users. Each interface provides roughly equivalent functionality.