GPU Health Monitor Configuration

View as Markdown

Overview

The GPU Health Monitor module watches GPU health using NVIDIA DCGM (Data Center GPU Manager) and reports hardware failures. This document covers all Helm configuration options for system administrators.

DCGM Deployment Modes

DCGM (Data Center GPU Manager) always runs as a DaemonSet with one pod per GPU node. The GPU Health Monitor can connect to DCGM in two modes:

DCGM with Kubernetes Service

DCGM DaemonSet exposes a Kubernetes service. GPU Health Monitor pods connect to DCGM on their local node via this service endpoint.

Characteristics:

  • DCGM runs as a DaemonSet (one pod per GPU node)
  • Kubernetes service provides DNS endpoint for DCGM
  • GPU Health Monitor connects via service DNS name

DCGM with Host Networking

DCGM DaemonSet uses host networking. GPU Health Monitor pods connect to DCGM via localhost:5555 on the host network.

Characteristics:

  • DCGM runs as a DaemonSet with hostNetwork: true
  • No Kubernetes service needed
  • GPU Health Monitor connects to localhost:5555

Configuration Reference

Module Enable/Disable

Controls whether the gpu-health-monitor module is deployed in the cluster.

1global:
2 gpuHealthMonitor:
3 enabled: true

Resources

Defines CPU and memory resource requests and limits for the gpu-health-monitor pod.

1gpu-health-monitor:
2 resources:
3 limits:
4 cpu: 500m
5 memory: 512Mi
6 requests:
7 cpu: 100m
8 memory: 128Mi

Logging

Controls verbosity of gpu-health-monitor logs.

1gpu-health-monitor:
2 verbose: "False" # Options: "True", "False"

DCGM Configuration

DCGM Service Mode

Configuration for connecting to DCGM running as a Kubernetes service.

1gpu-health-monitor:
2 dcgm:
3 dcgmK8sServiceEnabled: true
4 service:
5 endpoint: "nvidia-dcgm.gpu-operator.svc"
6 port: 5555

Parameters

dcgmK8sServiceEnabled

Enables connection to DCGM via Kubernetes service. When true, uses service.endpoint and service.port. When false, connects to localhost:5555 (sidecar mode).

service.endpoint

Kubernetes service DNS name for DCGM. Typically the DCGM service deployed by GPU Operator.

service.port

Port where DCGM is listening. Default is 5555.

DCGM Service Examples

Example 1: GPU Operator DCGM Service
1dcgm:
2 dcgmK8sServiceEnabled: true
3 service:
4 endpoint: "nvidia-dcgm.gpu-operator.svc"
5 port: 5555
Example 2: Custom Namespace DCGM Service
1dcgm:
2 dcgmK8sServiceEnabled: true
3 service:
4 endpoint: "dcgm-service.custom-namespace.svc.cluster.local"
5 port: 5555

Host Networking

Enables host network mode for GPU Health Monitor pods.

1gpu-health-monitor:
2 useHostNetworking: false

Set to true when DCGM is deployed with host networking (dcgm.dcgmK8sServiceEnabled: false). In this mode, GPU Health Monitor connects to DCGM via localhost:5555 on the host network.

Example: Host Networking Mode for connecting to DCGM

1dcgm:
2 dcgmK8sServiceEnabled: false
3
4useHostNetworking: true

Additional Volumes

Extension point for mounting additional host paths required by DCGM in specific environments.

Configuration Structure

1gpu-health-monitor:
2 additionalVolumeMounts: []
3 additionalHostVolumes: []

Parameters

additionalVolumeMounts

List of volume mounts to add to the GPU Health Monitor container. Each mount specifies where a volume should be mounted inside the container.

additionalHostVolumes

List of host path volumes to make available to the pod. Each volume references a path on the host node.

When to Use Additional Volumes

Additional volumes are required in environments where DCGM needs access to GPU drivers or libraries installed in non-standard host locations.

Common scenarios:

  • GCP GKE nodes with GPU drivers in /home/kubernetes/bin/nvidia
  • Custom driver installation paths

Volume Mount Examples

Example 1: GCP GKE Configuration

GCP GKE installs NVIDIA drivers and Vulkan ICD files in custom locations that the DCGM SDK needs to access.

1gpu-health-monitor:
2 additionalVolumeMounts:
3 - mountPath: /usr/local/nvidia
4 name: nvidia-install-dir-host
5 readOnly: true
6 - mountPath: /etc/vulkan/icd.d
7 name: vulkan-icd-mount
8 readOnly: true
9
10 additionalHostVolumes:
11 - name: nvidia-install-dir-host
12 hostPath:
13 path: /home/kubernetes/bin/nvidia
14 type: Directory
15 - name: vulkan-icd-mount
16 hostPath:
17 path: /home/kubernetes/bin/nvidia/vulkan/icd.d
18 type: Directory