GPU Monitor DCGM Failures | NVIDIA NVSentinel Documentation

Overview

GPU health monitor requires connection to NVIDIA DCGM for all GPU health checks. Connectivity failures prevent GPU monitoring entirely on affected nodes.

Key points:

DCGM can be exposed via Kubernetes service or localhost
Failures generate GpuDcgmConnectivityFailure node condition
Complete loss of GPU health monitoring on affected node

Symptoms

Node condition GpuDcgmConnectivityFailure present
GPU monitor logs show DCGM connection errors

Procedure

1. Check GPU Monitor Logs

$ kubectl logs -n nvsentinel <GPU_MONITOR_POD> --tail=50 | grep -i dcgm

Look for:

"Error getting DCGM handle"
"DCGM connectivity failure detected"
"Failed to connect to DCGM"

2. Identify DCGM Configuration

Check which DCGM mode is in use by verifying if the GPU Operator DCGM service exists:

$ # Check if DCGM service exists
$ kubectl get svc -n gpu-operator nvidia-dcgm
$ 
$ # If service exists, check service details
$ kubectl get svc -n gpu-operator nvidia-dcgm -o yaml

If the service exists, the cluster is using Kubernetes Service Mode. If the service doesn’t exist or is not exposed, the cluster is using Localhost Mode.

Verify the gpu-health-monitor pod configuration matches the expected mode:

$ kubectl get pod -n nvsentinel <GPU_MONITOR_POD> -o yaml | grep -A 2 "dcgm-addr"

Expected configurations:

Kubernetes Service Mode: --dcgm-addr nvidia-dcgm.gpu-operator.svc:5555 and --dcgm-k8s-service-enabled true
Localhost Mode: --dcgm-addr localhost:5555 and --dcgm-k8s-service-enabled false (requires hostNetwork: true)

These values come from Helm values dcgm.dcgmK8sServiceEnabled and dcgm.service.endpoint/dcgm.service.port.

3. Verify DCGM Pod Running

$ # Check DCGM pod on affected node
$ kubectl get pods -n gpu-operator -l app=nvidia-dcgm -o wide
$ 
$ # Check DCGM logs
$ kubectl logs -n gpu-operator <DCGM_POD> --tail=30

DCGM pod must be Running on the same node as the failing GPU monitor.

4. Test DCGM Connectivity

Test DCGM connectivity from within the gpu-health-monitor pod:

$ # Exec into the GPU monitor pod
$ kubectl exec -it -n nvsentinel <GPU_MONITOR_POD> -- /bin/bash
$ 
$ # For Kubernetes Service Mode, use the service endpoint
$ dcgmi discovery -l --host nvidia-dcgm.gpu-operator.svc:5555
$ 
$ # For Localhost Mode, use localhost
$ dcgmi discovery -l --host localhost:5555

If DCGM commands fail, check:

DCGM service exists: kubectl get svc -n gpu-operator | grep dcgm
DCGM pod is running on the same node
Network policies allow traffic from nvsentinel to gpu-operator namespace
For localhost mode: Verify hostNetwork: true in gpu-health-monitor DaemonSet

5. Verify Resolution

$ # Check condition cleared
$ kubectl describe node <NODE_NAME> | grep GpuDcgmConnectivityFailure
$ # Should show: Status: False (or condition absent)
$ 
$ # Watch GPU monitor logs for health checks
$ kubectl logs -n nvsentinel <GPU_MONITOR_POD> -f | grep "Publish DCGM"