Data Center GPU Manager User Guide :: Data Center GPU Manager Documentation

What is DCGM

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Tesla GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:

GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring

This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools.

This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of Tesla GPUs.

TERMINOLOGY

Term	Meaning
DCGM	NVIDIA’s Datacenter GPU Manager
NVIDIA Host Engine	Standalone executable wrapper for DCGM shared library
Host Engine daemon	Daemon mode of operation for the NVIDIA Host Engine
Fabric Manager	A module within the Host Engine daemon that supports NVSwitch fabric on DGX-2 or HGX-2.
3rd-party DCGM Agent	Any node-level process from a 3rd-party that runs DCGM in Embedded Mode
Embedded Mode	DCGM executing as a shared library within a 3rd-party DCGM agent
Standalone Mode	DCGM executing as a standalone process via the Host Engine
System Validation	Health checks encompassing the GPU, board and surrounding environment
HW diagnostic	System validation component focusing on GPU hardware correctness
RAS event	Reliability, Availability, Serviceability event. Corresponding to both fatal and non-fatal GPU issues
NVML	NVIDIA Management Library

Focus Areas

DCGM’s design is geared towards the following key functional areas.

Manage GPUs as collections of related resources. In the majority of large-scale GPU deployments there are multiple GPUs per host, and often multiple hosts per job. In most cases there is a strong desire to ensure homogeneity of behavior across these related resources, even as specific expectations may change from job to job or user to user, and even as multiple jobs may use resources on the same host simultaneously. DCGM applies a group-centric philosophy to node level GPU management.

Configure NVSwitches. On DGX-2 or HGX-2, all GPUs communicate by way of NVSwitch. The Fabric Manager component of DCGM configures the switches to form a single memory fabric among all participating GPUs, and monitors the NVLinks that support the fabric.

Define and enforce GPU configuration state. The behavior of NVIDIA GPUs can be controlled by users to match requirements of particular environments or applications. This includes performance characteristics such as clock settings, exclusivity constraints like compute mode, and environmental controls like power limits. DCGM provides enforcement and persistence mechanisms to ensure behavioral consistency across related GPUs.

Automate GPU management policies. NVIDIA GPUs have advanced capabilities that facilitate error containment and identify problem areas. Automated policies that define GPU response to certain classes of events, including recovery from errors and isolation of bad hardware, ensure higher reliability and a simplified administration environment. DCGM provides policies for common situations that require notification or automated action.

Provide robust, online health and diagnostics. The ability to ascertain the health of a GPU and its interaction with the surrounding system is a critical management need. This need comes in various forms, from passive background monitoring to quick system validation to extensive hardware diagnostics. In all cases it is important to provide these features with minimal impact on the system and minimal additional environmental requirements. DCGM provides extensive automated and non-automated health and diagnostic capabilities.

Enable job-level statistics and accounting. Understanding GPU usage is important for schedulers and resource managers. Tying this information together with RAS events, performance information and other telemetry, especially at the boundaries of a workload, is very useful in explaining job behavior and root-causing potential performance or execution issues. DCGM provides mechanism to gather, group and analyze data at the job level.

Target Users

DCGM is targeted at the following users:

OEMs and ISVs wishing to improve GPU integration within their software.
Datacenter admins managing their own GPU enabled infrastructure.
Individual users and FAEs needing better insight into GPU behavior, especially during problem analysis.
All DGX-2 and HGX-2 users will use the Fabric Manager to configure and monitor the NVSwitch fabric.

DCGM provides different interfaces to serve different consumers and use cases. Programmatic access via C and Python is geared towards integration with 3rd-party software. Python interfaces are also geared towards admin-centric scripting environments. CLI-based tools are present to provide an interactive out-of-the-box experience for end users. Each interface provides roughly equivalent functionality.

Overview

What is DCGM

TERMINOLOGY

Focus Areas

Target Users