Runbook: Node Condition Update Failures
Runbook: Node Condition Update Failures
Overview
Node conditions reflect hardware health status (GPU, NVSwitch). Update failures prevent accurate health reporting and can impact scheduling decisions.
Key points:
- Conditions updated for fatal events
- Failures block health status visibility
Symptoms
- Metric
nvsentinel_node_condition_update_total{status="failed"}is increasing - Node conditions don’t reflect current hardware health
Procedure
1. Check Platform Connector Logs
Look for error codes:
- 429 → API server throttling
- 403 → RBAC permission denied
- 404 → Node doesn’t exist
- 409 → Conflict (should auto-resolve with retries)
- Connection refused/timeout → API server unreachable
2. Verify API Server is Reachable
3. Verify RBAC Permissions
Should return yes. If no, check the ClusterRole:
Should include update, patch verbs for nodes/status.
4. Check Node Exists
If node was deleted or renamed, updates will fail.