Runbook: Node Conditions Without Cordoning
Runbook: Node Conditions Without Cordoning
Overview
This runbook handles cases where a node has GPU health conditions but is NOT cordoned. This typically indicates:
- The fault-quarantine CEL configuration is excluding the health event from cordoning
- The node was manually uncordoned by an operator
Procedure
1. Check Node Status
Verify the node has conditions but is schedulable:
The node should show Ready and SchedulingDisabled should NOT be present.
View the conditions:
2. Check for Manual Uncordon Annotation
Check if the node was manually uncordoned:
If the annotation is present:
The node was manually uncordoned by an operator. When this annotation exists:
- Fault-quarantine will NOT re-cordon the node for the same health event
- Someone has acknowledged the condition and decided to keep the node schedulable
- The health issue may be deemed non-critical or remediation is planned for later
To investigate who uncordoned:
Check Kubernetes audit logs or your organization’s access policy to determine who manually uncordoned the node and why.
3. Check Fault Quarantine Configuration
If no manual uncordon annotation exists, the health event is likely excluded by CEL rules.
Check the fault-quarantine configuration:
Look at the rule-sets section. Each ruleset has CEL expressions under match.all or match.any.
Example exclusion patterns:
This rule only matches fatal events. Non-fatal events won’t trigger cordoning.
This rule excludes DCGM_FR_CLOCK_THROTTLE_POWER errors from cordoning.
This rule excludes nodes with the k8saas.nvidia.com/ManagedByNVSentinel=false label.
4. Determine Why the Node Wasn’t Cordoned
Based on the node conditions present, identify the likely reason:
Common reasons for no cordoning:
- Non-fatal event - The health check detected an issue but marked it as non-fatal, and rules require
event.isFatal == true - Error code exclusion - Specific error codes are filtered (e.g., thermal warnings, transient errors)
- Node label exclusion - Node has
k8saas.nvidia.com/ManagedByNVSentinel=falselabel - Check name exclusion - Certain health checks are excluded from cordoning rules
Check if the node has the exclusion label:
If the value is false, NVSentinel is not managing this node.
5. Determining Next Steps
If the health condition is acceptable:
No action needed. The configuration is working as intended - monitoring the issue without impacting scheduling.
If the node should be cordoned:
Option 1: Update the fault-quarantine configuration to include this event type.
Option 2: Manually cordon the node:
Note: Manual cordoning won’t trigger NVSentinel’s remediation flow. For full remediation, the health event needs to match a CEL rule.
If configuration change is needed:
Update the fault-quarantine ConfigMap and restart the deployment: