Version: 3.1 (Latest)
User Guide:
Overview
Terminology
Focus Areas
Provide robust, online health and diagnostics
Enable job-level statistics and continuous GPU telemetry
Manage GPUs as collections of related resources
Configure NVSwitches
Define and enforce GPU configuration state
Automate GPU management policies
Target Users
Getting Started
Supported Platforms
Supported Linux Distributions
Installation
Pre-Requisites
Remove Older Installations
Installation
Ubuntu LTS and Debian
RHEL / CentOS / Rocky Linux
SUSE SLES / OpenSUSE
Post-Install
Basic Components
DCGM shared library
NVIDIA Host Engine
DCGM CLI Tool
Python Bindings
Software Development Kit
Modes of Operation
Embedded Mode
Standalone Mode
Static Library
Feature Overview
Groups
Configuration
Policy
Notifications
Actions
Job Statistics
Health and Diagnostics
Background Health Checks
Active Health Checks
Topology
NVLink Counters
Field Groups
Link Status
Profiling Metrics
Metrics
Multiplexing of Profiling Counters
Profiling Sampling Rate
CUDA Test Generator (dcgmproftester)
Metrics on Multi-Instance GPU
Example 1
Understanding Metrics
Platform Support
DCGM Diagnostics
Overview
DCGM Diagnostic Goals
Beyond the Scope of the DCGM Diagnostics
Run Levels and Tests
Overview of Plugins
Deployment Plugin
Preconditions
Configuration Parameters
Stat Outputs
Failure
PCIe - GPU Bandwidth Plugin
Preconditions
Sub tests
Memtest Diagnostic
Overview
Test Descriptions
Supported Parameters
Sample Commands
Pulse Test Diagnostic
Overview
Test Description
Sample Commands
Failure Conditions
End User Diagnostics (EUD)
Supported Products
Included Tests
Getting Started with EUD
Running the EUD
DCGM Modularity
Module List
Disabling Modules
Error Injection
Overview
Error Injection Workflow
Field Identifiers
Examples with
dcgmi
Thermal Violation
PCIe Replay Errors
ECC Errors
API Examples
API Reference:
Modules
Administrative
Init and Shutdown
Auxilary information about DCGM engine
System
Discovery
Grouping
Field Grouping
Status Handling
Configuration
Setup and Management
Manual Invocation
Field APIs
Process Statistics
Job Statistics
Health Monitor
Policies
Setup and Management
Manual Invocation
Topology
Metadata
Topology
Modules
Profiling
Enums and Macros
Structure Definitions
Field Types
Field Scope
Field Entity
Field Identifiers
DCGMAPI_Admin_ExecCtrl
Data Structures
Release Notes:
DCGM Release Notes
3.1.6
Improvements
Fixed Issues
3.1.3
New Features
Major API changes and Deprecations
Fixed Issues
Known Issues
NVIDIA DCGM Documentation
Select Version
latest
»
Modules
Modules
ΒΆ
Administrative
Init and Shutdown
Auxilary information about DCGM engine
System
Discovery
Grouping
Field Grouping
Status Handling
Configuration
Setup and Management
Manual Invocation
Field APIs
Process Statistics
Job Statistics
Health Monitor
Policies
Setup and Management
Manual Invocation
Topology
Metadata
Topology
Modules
Profiling
Enums and Macros
Structure Definitions
Field Types
Field Scope
Field Entity
Field Identifiers
DCGMAPI_Admin_ExecCtrl