Runbook: GPU Health Monitor DCGM Connectivity Failures
Runbook: GPU Health Monitor DCGM Connectivity Failures
Overview
GPU health monitor requires connection to NVIDIA DCGM for all GPU health checks. Connectivity failures prevent GPU monitoring entirely on affected nodes.
Key points:
- DCGM can be exposed via Kubernetes service or localhost
- Failures generate
GpuDcgmConnectivityFailurenode condition - Complete loss of GPU health monitoring on affected node
Symptoms
- Node condition
GpuDcgmConnectivityFailurepresent - GPU monitor logs show DCGM connection errors
Procedure
1. Check GPU Monitor Logs
Look for:
"Error getting DCGM handle""DCGM connectivity failure detected""Failed to connect to DCGM"
2. Identify DCGM Configuration
Check which DCGM mode is in use by verifying if the GPU Operator DCGM service exists:
If the service exists, the cluster is using Kubernetes Service Mode. If the service doesn’t exist or is not exposed, the cluster is using Localhost Mode.
Verify the gpu-health-monitor pod configuration matches the expected mode:
Expected configurations:
- Kubernetes Service Mode:
--dcgm-addr nvidia-dcgm.gpu-operator.svc:5555and--dcgm-k8s-service-enabled true - Localhost Mode:
--dcgm-addr localhost:5555and--dcgm-k8s-service-enabled false(requireshostNetwork: true)
These values come from Helm values dcgm.dcgmK8sServiceEnabled and dcgm.service.endpoint/dcgm.service.port.
3. Verify DCGM Pod Running
DCGM pod must be Running on the same node as the failing GPU monitor.
4. Test DCGM Connectivity
Test DCGM connectivity from within the gpu-health-monitor pod:
If DCGM commands fail, check:
- DCGM service exists:
kubectl get svc -n gpu-operator | grep dcgm - DCGM pod is running on the same node
- Network policies allow traffic from nvsentinel to gpu-operator namespace
- For localhost mode: Verify
hostNetwork: truein gpu-health-monitor DaemonSet