Data Center GPU Manager User Guide This document describes how to use the NVIDIA Data Center GPU Manager (DCGM) software. Table of Contents 1. Overview 1.1. What is DCGM 1.2. Focus Areas 1.3. Target Users 2. Getting Started 2.1. Supported Platforms 2.2. Installation 2.3. Basic Components 2.4. Modes of Operation 2.4.1. Embedded Mode 2.4.2. Standalone Mode 2.5. Static Library 3. Feature Overview 3.1. Groups 3.2. Configuration 3.3. Policy 3.3.1. Notifications 3.3.2. Actions 3.4. Job Stats 3.5. Health and Diagnostics 3.5.1. Background Health Checks 3.5.2. Active Health Checks 3.6. Topology 3.7. NVlink Counters 3.8. Field Groups 3.9. Link Status 4. Integrating with DCGM 4.1. Integrating with DCGM Reader 4.1.1. Reading Using the Dictionary 4.1.2. Reading Using Inheritance 4.1.3. Completing the Proof of Concept 4.1.4. Additional Customization 4.2. Integrating with Prometheus and Grafana 4.2.1. Starting the Prometheus Server 4.2.2. Starting the Prometheus Client 4.2.3. Integrating with Grafana 4.2.4. Customizing the Prometheus Client 5. DCGM Diagnostics 5.1. Overview 5.1.1. DCGM Diagnostics Goals 5.1.2. Beyond the Scope of the DCGM Diagnostics 5.1.3. Dependencies 5.1.4. Supported Products 5.2. Using DCGM Diagnostics 5.2.1. Command line options 5.2.2. Usage Examples 5.2.3. Configuration file 5.2.4. Global parameters 5.2.5. GPU parameters 5.2.6. Test Parameters 5.3. Overview of Plugins 5.3.1. Deployment Plugin 5.3.2. PCIe - GPU Bandwidth Plugin 5.3.3. Memory Bandwidth Plugin 5.3.4. SM Stress Plugin 5.3.5. Hardware Disagnostic Plugin 5.3.6. Targeted Stress Plugin 5.3.7. Power Plugin 5.4. Test Output 5.4.1. JSON Output 6. DCGM Modularity 6.1. Module List 6.2. Blacklisting Modules Notices