The GPU Health Monitor module watches GPU health using NVIDIA DCGM (Data Center GPU Manager) and reports hardware failures. This document covers all Helm configuration options for system administrators.
DCGM (Data Center GPU Manager) always runs as a DaemonSet with one pod per GPU node. The GPU Health Monitor can connect to DCGM in two modes:
DCGM DaemonSet exposes a Kubernetes service. GPU Health Monitor pods connect to DCGM on their local node via this service endpoint.
Characteristics:
DCGM DaemonSet uses host networking. GPU Health Monitor pods connect to DCGM via localhost:5555 on the host network.
Characteristics:
hostNetwork: truelocalhost:5555Controls whether the gpu-health-monitor module is deployed in the cluster.
Defines CPU and memory resource requests and limits for the gpu-health-monitor pod.
Controls verbosity of gpu-health-monitor logs.
Configuration for connecting to DCGM running as a Kubernetes service.
Enables connection to DCGM via Kubernetes service. When true, uses service.endpoint and service.port. When false, connects to localhost:5555 (sidecar mode).
Kubernetes service DNS name for DCGM. Typically the DCGM service deployed by GPU Operator.
Port where DCGM is listening. Default is 5555.
Enables host network mode for GPU Health Monitor pods.
Set to true when DCGM is deployed with host networking (dcgm.dcgmK8sServiceEnabled: false). In this mode, GPU Health Monitor connects to DCGM via localhost:5555 on the host network.
Extension point for mounting additional host paths required by DCGM in specific environments.
List of volume mounts to add to the GPU Health Monitor container. Each mount specifies where a volume should be mounted inside the container.
List of host path volumes to make available to the pod. Each volume references a path on the host node.
Additional volumes are required in environments where DCGM needs access to GPU drivers or libraries installed in non-standard host locations.
Common scenarios:
/home/kubernetes/bin/nvidiaGCP GKE installs NVIDIA drivers and Vulkan ICD files in custom locations that the DCGM SDK needs to access.