Overview¶
The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.
Terminology¶
Term |
Meaning |
---|---|
DCGM |
NVIDIA’s Datacenter GPU Manager |
NVIDIA Host Engine |
Standalone executable wrapper for DCGM shared library |
Host Engine daemon |
Daemon mode of operation for the NVIDIA Host Engine |
Fabric Manager |
A module within the Host Engine daemon that supports NVSwitch fabric on DGX-2 or HGX-2. |
3rd-party DCGM Agent |
Any node-level process from a 3rd-party that runs DCGM in Embedded Mode |
Embedded Mode |
DCGM executing as a shared library within a 3rd-party DCGM agent |
Standalone Mode |
DCGM executing as a standalone process via the Host Engine |
System Validation |
Health checks encompassing the GPU, board and surrounding environment |
HW diagnostic |
System validation component focusing on GPU hardware correctness |
RAS event |
Reliability, Availability, Serviceability event. Corresponding to both fatal and non-fatal GPU issues |
NVML |
NVIDIA Management Library |
Focus Areas¶
DCGM’s design is geared towards the following key functional areas.
Provide robust, online health and diagnostics¶
The ability to ascertain the health of a GPU and its interaction with the surrounding system is a critical management need. This need comes in various forms, from passive background monitoring to quick system validation to extensive hardware diagnostics. In all cases it is important to provide these features with minimal impact on the system and minimal additional environmental requirements. DCGM provides extensive automated and non-automated health and diagnostic capabilities.
Enable job-level statistics and continuous GPU telemetry¶
Understanding GPU usage is important for schedulers and resource managers. Tying this information together with RAS events, performance information and other telemetry, especially at the boundaries of a workload, is very useful in explaining job behavior and root-causing potential performance or execution issues. DCGM provides mechanism to gather, group and analyze data at the job level. DCGM also provides continuous GPU telemetry at very low performance overheads.
Configure NVSwitches¶
On DGX-2 or HGX-2, all GPUs communicate by way of NVSwitch. The Fabric Manager component of DCGM configures the switches to form a single memory fabric among all participating GPUs, and monitors the NVLinks that support the fabric.
Note
As of v2.x, DCGM no longer includes the Fabric Manager, which is a separate component that needs to be installed for NVSwitch based systems.
Define and enforce GPU configuration state¶
The behavior of NVIDIA GPUs can be controlled by users to match requirements of particular environments or applications. This includes performance characteristics such as clock settings, exclusivity constraints like compute mode, and environmental controls like power limits. DCGM provides enforcement and persistence mechanisms to ensure behavioral consistency across related GPUs.
Automate GPU management policies¶
NVIDIA GPUs have advanced capabilities that facilitate error containment and identify problem areas. Automated policies that define GPU response to certain classes of events, including recovery from errors and isolation of bad hardware, ensure higher reliability and a simplified administration environment. DCGM provides policies for common situations that require notification or automated action.
Target Users¶
DCGM is targeted at the following users:
OEMs and ISVs wishing to improve GPU integration within their software.
Datacenter admins managing their own GPU enabled infrastructure.
Individual users and FAEs needing better insight into GPU behavior, especially during problem analysis.
All DGX-2 and HGX-2 users will use the Fabric Manager to configure and monitor the NVSwitch fabric.
DCGM provides different interfaces to serve different consumers and use cases. Programmatic access via C, Python and Go is geared towards integration with 3rd-party software. Python interfaces are also geared towards admin-centric scripting environments. CLI-based tools are present to provide an interactive out-of-the-box experience for end users. Each interface provides roughly equivalent functionality.