Overview

What is DCGM

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Tesla GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:
  • GPU behavior monitoring
  • GPU configuration management
  • GPU policy oversight
  • GPU health and diagnostics
  • GPU accounting and process statistics
  • NVSwitch configuration and monitoring
This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools.

This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of Tesla GPUs.

TERMINOLOGY

Term Meaning
DCGM NVIDIA’s Datacenter GPU Manager
NVIDIA Host Engine Standalone executable wrapper for DCGM shared library
Host Engine daemon Daemon mode of operation for the NVIDIA Host Engine
Fabric Manager A module within the Host Engine daemon that supports NVSwitch fabric on DGX-2.
3rd-party DCGM Agent Any node-level process from a 3rd-party that runs DCGM in Embedded Mode
Embedded Mode DCGM executing as a shared library within a 3rd-party DCGM agent
Standalone Mode DCGM executing as a standalone process via the Host Engine
System Validation Health checks encompassing the GPU, board and surrounding environment
HW diagnostic System validation component focusing on GPU hardware correctness
RAS event Reliability, Availability, Serviceability event. Corresponding to both fatal and non-fatal GPU issues
NVML NVIDIA Management Library

Focus Areas

DCGM’s design is geared towards the following key functional areas.

Manage GPUs as collections of related resources. In the majority of large-scale GPU deployments there are multiple GPUs per host, and often multiple hosts per job. In most cases there is a strong desire to ensure homogeneity of behavior across these related resources, even as specific expectations may change from job to job or user to user, and even as multiple jobs may use resources on the same host simultaneously. DCGM applies a group-centric philosophy to node level GPU management.

Configure NVSwitches. On DGX-2, all GPUs communicate by way of NVSwitch. The Fabric Manager component of DCGM configures the switches to form a single memory fabric among all participating GPUs, and monitors the NVLinks that support the fabric.

Define and enforce GPU configuration state. The behavior of NVIDIA GPUs can be controlled by users to match requirements of particular environments or applications. This includes performance characteristics such as clock settings, exclusivity constraints like compute mode, and environmental controls like power limits. DCGM provides enforcement and persistence mechanisms to ensure behavioral consistency across related GPUs.

Automate GPU management policies. NVIDIA GPUs have advanced capabilities that facilitate error containment and identify problem areas. Automated policies that define GPU response to certain classes of events, including recovery from errors and isolation of bad hardware, ensure higher reliability and a simplified administration environment. DCGM provides policies for common situations that require notification or automated action.

Provide robust, online health and diagnostics. The ability to ascertain the health of a GPU and its interaction with the surrounding system is a critical management need. This need comes in various forms, from passive background monitoring to quick system validation to extensive hardware diagnostics. In all cases it is important to provide these features with minimal impact on the system and minimal additional environmental requirements. DCGM provides extensive automated and non-automated health and diagnostic capabilities.

Enable job-level statistics and accounting. Understanding GPU usage is important for schedulers and resource managers. Tying this information together with RAS events, performance information and other telemetry, especially at the boundaries of a workload, is very useful in explaining job behavior and root-causing potential performance or execution issues. DCGM provides mechanism to gather, group and analyze data at the job level.

Target Users

DCGM is targeted at the following users:
  • OEMs and ISVs wishing to improve GPU integration within their software.
  • Datacenter admins managing their own GPU enabled infrastructure.
  • Individual users and FAEs needing better insight into GPU behavior, especially during problem analysis.
  • All DGX-2 users will use the Fabric Manager to configure and monitor the NVSwitch fabric.
DCGM provides different interfaces to serve different consumers and use cases. Programmatic access via C and Python is geared towards integration with 3rd-party software. Python interfaces are also geared towards admin-centric scripting environments. CLI-based tools are present to provide an interactive out-of-the-box experience for end users. Each interface provides roughly equivalent functionality.