Runbook: Health Monitor UDS Communication Failures

View as Markdown

Overview

Health monitors (GPU, NVSwitch, syslog, CSP) publish events via gRPC over Unix Domain Socket (UDS) to platform-connector. Communication failures block all health event reporting.

Key points:

  • Platform-connector runs as a DaemonSet (one pod per node)
  • Each node has its own UDS socket (/var/run/nvsentinel.sock)
  • Both platform-connector and health monitors must mount /var/run/nvsentinel from hostPath
  • Platform-connector requires MongoDB connection during startup

Symptoms

  • Metric health_events_insertion_to_uds_error or trigger_uds_send_errors_total increasing
  • Health monitor logs show gRPC errors (code 14: Unavailable)
  • No health events in MongoDB despite monitors running

Procedure

1. Identify Affected Node

$# Find which node has the failing health monitor
$kubectl get pods -n nvsentinel -l app.kubernetes.io/name=gpu-health-monitor -o wide
$
$# Check health monitor logs for UDS errors
$kubectl logs -n nvsentinel <HEALTH_MONITOR_POD>

Look for:

  • "code = Unavailable" → Socket closed or platform-connector not running
  • "connection refused" → Socket doesn’t exist
  • "broken pipe" → Socket was closed mid-communication

2. Check Platform Connector on That Node

$# Find platform-connector pod on the same node
$kubectl get pods -n nvsentinel -l app.kubernetes.io/name=nvsentinel -o wide | grep <NODE_NAME>
$
$# Check platform-connector logs
$kubectl logs -n nvsentinel <PLATFORM_CONNECTOR_POD>

Look for errors:

  • "failed to initialize database store connector" → MongoDB connection failed
  • "failed to create database client" → MongoDB authentication or network issue
  • "failed to listen on unix socket" → Volume mount issue

3. Verify MongoDB Connectivity

Platform-connector requires MongoDB connection during startup. If MongoDB is unavailable, platform-connector will fail to start.

$# Check MongoDB pods are running
$kubectl get pods -n nvsentinel -l app.kubernetes.io/name=mongodb
$# All pods should be Running and Ready
$
$# Check certificates (mongo-root-ca, mongo-app-client-cert, mongo-server-cert-*)
$kubectl get certificates -n nvsentinel
$# All should show READY = True
$
$# If certificates not ready, check cert-manager
$kubectl get pods -n cert-manager
$
$# Check MongoDB database creation job
$kubectl get job -n nvsentinel create-mongodb-database
$# Should show COMPLETIONS: 1/1

If the MongoDB job needs to be rerun:

$# Save and recreate the job
$kubectl get job create-mongodb-database -n nvsentinel -o yaml > create-mongodb-database.yaml
$kubectl delete job -n nvsentinel create-mongodb-database
$kubectl apply -f create-mongodb-database.yaml

Platform-connector connects to MongoDB on port 27017 with TLS. Check network policies:

$kubectl get networkpolicies -n nvsentinel -o yaml

4. Verify Volume Mounts

$# Check platform-connector mount
$kubectl get daemonset platform-connectors -n nvsentinel -o yaml | grep -A 3 "/var/run/nvsentinel"
$
$# Check health monitor mount
$kubectl get daemonset gpu-health-monitor -n nvsentinel -o yaml | grep -A 3 "/var/run/nvsentinel"

Both should be mounted from hostPath at /var/run/nvsentinel.

6. Verify Resolution

$# Watch health monitor logs for successful sends
$kubectl logs -n nvsentinel <GPU_MONITOR_POD> -f