For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • DCGM Deployment Modes
  • DCGM with Kubernetes Service
  • DCGM with Host Networking
  • Configuration Reference
  • Module Enable/Disable
  • Resources
  • Logging
  • DCGM Configuration
  • DCGM Service Mode
  • Parameters
  • dcgmK8sServiceEnabled
  • service.endpoint
  • service.port
  • DCGM Service Examples
  • Example 1: GPU Operator DCGM Service
  • Example 2: Custom Namespace DCGM Service
  • Host Networking
  • Example: Host Networking Mode for connecting to DCGM
  • Additional Volumes
  • Configuration Structure
  • Parameters
  • additionalVolumeMounts
  • additionalHostVolumes
  • When to Use Additional Volumes
  • Volume Mount Examples
  • Example 1: GCP GKE Configuration
Configuration

GPU Health Monitor Configuration

||View as Markdown|
Previous

Audit Logging

Next

Syslog Health Monitor

Overview

The GPU Health Monitor module watches GPU health using NVIDIA DCGM (Data Center GPU Manager) and reports hardware failures. This document covers all Helm configuration options for system administrators.

DCGM Deployment Modes

DCGM (Data Center GPU Manager) always runs as a DaemonSet with one pod per GPU node. The GPU Health Monitor can connect to DCGM in two modes:

DCGM with Kubernetes Service

DCGM DaemonSet exposes a Kubernetes service. GPU Health Monitor pods connect to DCGM on their local node via this service endpoint.

Characteristics:

  • DCGM runs as a DaemonSet (one pod per GPU node)
  • Kubernetes service provides DNS endpoint for DCGM
  • GPU Health Monitor connects via service DNS name

DCGM with Host Networking

DCGM DaemonSet uses host networking. GPU Health Monitor pods connect to DCGM via localhost:5555 on the host network.

Characteristics:

  • DCGM runs as a DaemonSet with hostNetwork: true
  • No Kubernetes service needed
  • GPU Health Monitor connects to localhost:5555

Configuration Reference

Module Enable/Disable

Controls whether the gpu-health-monitor module is deployed in the cluster.

1global:
2 gpuHealthMonitor:
3 enabled: true

Resources

Defines CPU and memory resource requests and limits for the gpu-health-monitor pod.

1gpu-health-monitor:
2 resources:
3 limits:
4 cpu: 500m
5 memory: 512Mi
6 requests:
7 cpu: 100m
8 memory: 128Mi

Logging

Controls verbosity of gpu-health-monitor logs.

1gpu-health-monitor:
2 verbose: "False" # Options: "True", "False"

DCGM Configuration

DCGM Service Mode

Configuration for connecting to DCGM running as a Kubernetes service.

1gpu-health-monitor:
2 dcgm:
3 dcgmK8sServiceEnabled: true
4 service:
5 endpoint: "nvidia-dcgm.gpu-operator.svc"
6 port: 5555

Parameters

dcgmK8sServiceEnabled

Enables connection to DCGM via Kubernetes service. When true, uses service.endpoint and service.port. When false, connects to localhost:5555 (sidecar mode).

service.endpoint

Kubernetes service DNS name for DCGM. Typically the DCGM service deployed by GPU Operator.

service.port

Port where DCGM is listening. Default is 5555.

DCGM Service Examples

Example 1: GPU Operator DCGM Service
1dcgm:
2 dcgmK8sServiceEnabled: true
3 service:
4 endpoint: "nvidia-dcgm.gpu-operator.svc"
5 port: 5555
Example 2: Custom Namespace DCGM Service
1dcgm:
2 dcgmK8sServiceEnabled: true
3 service:
4 endpoint: "dcgm-service.custom-namespace.svc.cluster.local"
5 port: 5555

Host Networking

Enables host network mode for GPU Health Monitor pods.

1gpu-health-monitor:
2 useHostNetworking: false

Set to true when DCGM is deployed with host networking (dcgm.dcgmK8sServiceEnabled: false). In this mode, GPU Health Monitor connects to DCGM via localhost:5555 on the host network.

Example: Host Networking Mode for connecting to DCGM

1dcgm:
2 dcgmK8sServiceEnabled: false
3
4useHostNetworking: true

Additional Volumes

Extension point for mounting additional host paths required by DCGM in specific environments.

Configuration Structure

1gpu-health-monitor:
2 additionalVolumeMounts: []
3 additionalHostVolumes: []

Parameters

additionalVolumeMounts

List of volume mounts to add to the GPU Health Monitor container. Each mount specifies where a volume should be mounted inside the container.

additionalHostVolumes

List of host path volumes to make available to the pod. Each volume references a path on the host node.

When to Use Additional Volumes

Additional volumes are required in environments where DCGM needs access to GPU drivers or libraries installed in non-standard host locations.

Common scenarios:

  • GCP GKE nodes with GPU drivers in /home/kubernetes/bin/nvidia
  • Custom driver installation paths

Volume Mount Examples

Example 1: GCP GKE Configuration

GCP GKE installs NVIDIA drivers and Vulkan ICD files in custom locations that the DCGM SDK needs to access.

1gpu-health-monitor:
2 additionalVolumeMounts:
3 - mountPath: /usr/local/nvidia
4 name: nvidia-install-dir-host
5 readOnly: true
6 - mountPath: /etc/vulkan/icd.d
7 name: vulkan-icd-mount
8 readOnly: true
9
10 additionalHostVolumes:
11 - name: nvidia-install-dir-host
12 hostPath:
13 path: /home/kubernetes/bin/nvidia
14 type: Directory
15 - name: vulkan-icd-mount
16 hostPath:
17 path: /home/kubernetes/bin/nvidia/vulkan/icd.d
18 type: Directory