Runbook: Health Monitor UDS Communication Failures
Runbook: Health Monitor UDS Communication Failures
Overview
Health monitors (GPU, NVSwitch, syslog, CSP) publish events via gRPC over Unix Domain Socket (UDS) to platform-connector. Communication failures block all health event reporting.
Key points:
- Platform-connector runs as a DaemonSet (one pod per node)
- Each node has its own UDS socket (
/var/run/nvsentinel.sock) - Both platform-connector and health monitors must mount
/var/run/nvsentinelfrom hostPath - Platform-connector requires MongoDB connection during startup
Symptoms
- Metric
health_events_insertion_to_uds_errorortrigger_uds_send_errors_totalincreasing - Health monitor logs show gRPC errors (code 14: Unavailable)
- No health events in MongoDB despite monitors running
Procedure
1. Identify Affected Node
Look for:
"code = Unavailable"→ Socket closed or platform-connector not running"connection refused"→ Socket doesn’t exist"broken pipe"→ Socket was closed mid-communication
2. Check Platform Connector on That Node
Look for errors:
"failed to initialize database store connector"→ MongoDB connection failed"failed to create database client"→ MongoDB authentication or network issue"failed to listen on unix socket"→ Volume mount issue
3. Verify MongoDB Connectivity
Platform-connector requires MongoDB connection during startup. If MongoDB is unavailable, platform-connector will fail to start.
If the MongoDB job needs to be rerun:
Platform-connector connects to MongoDB on port 27017 with TLS. Check network policies:
4. Verify Volume Mounts
Both should be mounted from hostPath at /var/run/nvsentinel.