Runbook: Cordoned Nodes
Runbook: Cordoned Nodes
Overview
This runbook handles nodes that have been cordoned in the cluster. Cordoned nodes indicate detected hardware issues that require investigation and remediation.
Key points:
- Nodes progress through states: quarantined → draining → drain-succeeded → remediating → remediation-succeeded
- Terminal failure states require manual intervention
- Each cordoned node must be individually triaged based on its state
Procedure
1. Identify Cordoned Nodes
Find nodes cordoned by NVSentinel:
If no nodes are returned: The cordoning was done by another system. Do not use this runbook.
2. Save Health Event for Each Cordoned Node
Save the health event details for reference:
This shows:
errorCode- The problem with the noderecommendedAction- How to remediate- Other diagnostic information
3. Check Node State
Check the NVSentinel state label:
Possible states:
quarantined- Fault detected, waiting for drain to startdraining- Node is being draineddrain-succeeded- Drain completed, waiting for remediationdrain-failed- TERMINAL STATE - Drain failedremediating- CR creation in progress (transitions quickly)remediation-succeeded- CR created, remediation may be in progressremediation-failed- TERMINAL STATE - Unsupported action or CR creation failed- (no label) - Should not be cordoned
Proceed to the appropriate section based on the state.
4a. If State is drain-failed (Terminal State)
Drain failed and no remediation will be attempted. Check for stuck pods:
If pods are stuck:
Check for PodDisruptionBudgets blocking eviction:
Force delete stuck pods:
After pods are cleared, manually remediate:
Check the recommended action from the health event:
Follow the remediation action manually (reboot via CSP console, contact support, etc.).
Monitor if the health check starts passing. If it does, the node should be automatically uncordoned. If the health check doesn’t pass, investigate with your organization’s support team.
4b. If State is remediation-failed (Terminal State)
This means either the recommended action is unsupported or CR creation failed.
Check fault-remediation logs:
If action is unsupported:
The recommended action cannot be automated. Manually remediate based on the recommended action from the saved health event.
Monitor if the health check starts passing. If it does, the node should be automatically uncordoned. If the health check doesn’t pass, investigate with your organization’s support team.
If CR creation failed:
Investigate why CR creation failed from the logs. Common causes:
- API server issues
- Invalid CR template
After resolving the issue, manually remediate the node.
Monitor if the health check starts passing. If it does, the node should be automatically uncordoned. If the health check doesn’t pass, investigate with your organization’s support team.
4c. If State is quarantined, draining, drain-succeeded, or remediating
Node is in an active remediation flow. Wait for the process to complete.
Monitor state changes:
4d. If State is remediation-succeeded
This means the CR was created successfully. However, the actual remediation operation (e.g. reboot or reset) may still be in progress or may have failed.
Check if the remediation CR exists and its status:
For COMPONENT_RESET actions:
For RESTART_VM and RESTART_BM actions:
If CR shows successful completion:
The node should automatically uncordon once health checks pass. Monitor:
If node remains cordoned after 10 minutes:
Check fault-quarantine logs to determine why:
If health checks are passing but node remains cordoned, manually uncordon:
If CR shows failure or is stuck:
GPUReset:
When processing a GPUReset request, Janitor executes the following steps exposed through status conditions:
- Ready: check that there are no in-progress GPUReset CRs executing against the same node.
- ServicesTornDown: remove gpu-operator services including the nvidia-device-plugin-daemonset, nvidia-dcgm, and nvidia-dcgm-exporter.
- ResetJobCreated: create a privileged job against the given node with the GPU needing reset.
- ResetJobCompleted: wait for the pod from the privileged job to disable persistence mode, reset the target GPU using nvidia-smi, re-enable persistence mode, and write a syslog event indicating the GPU has been reset.
- ServicesRestored: restore gpu-operator services including the nvidia-device-plugin-daemonset, nvidia-dcgm, and nvidia-dcgm-exporter.
- Complete: complete reconciling of the GPUReset CR.
Check Janitor controller-manager logs for why the ServicesTornDown, ResetJobCreated, or ServicesRestored steps failed during processing of a GPUReset. If the ResetJobCompleted step fails, check the logs for the reset pod:
RebootNode:
When processing a RebootNode request, Janitor executes the following steps exposed through status conditions:
- SignalSent: issue a reboot request against the node for the given CSP.
- NodeReady: wait for the node to return to a Ready status after a reboot is triggered.
Check Janitor controller-manager logs for why the SignalSent or NodeReady steps failed during processing of a RebootNode.
5. Manual remediations
If a remediation action needs to be retried or executed manually (for example if the given health event published a CONTACT_SUPPORT recommended action), it might be required to manually create a RebootNode or GPUReset CR outside of NVSentinel.
WARNING:
- If a node progressed through the
draining->drain-succeeded->remediating->remediation-succeededstates but was not automatically uncordoned, it is safe to manually create a new RebootNode CR for the RESTART_VM or RESTART_BM actions because the node was already fully drained. Similarly, it is safe to create a new GPUReset CR, targeting the same node and GPU pair, for COMPONENT_RESET actions because the GPU needing reset was drained of any workload pods assigned to that GPU. - If an operator wants to remediate the given node with a RebootNode CR and the previous remediation action which failed was a GPUReset, the operator will need to ensure that the node is fully drained because NVSentinel would have only partially drained the node for pods which were assigned to the individual GPU needing reset. Follow the same steps outlined in 4a for handling the
drain-failedstate to manually execute a full drain against the node prior to creating a RebootNode. - If an operator wants to remediate a different GPU compared to what was targeted in the initial GPUReset CR, the operator will need to ensure that all pods leveraging the given GPU are drained or else the reset can fail because processes could still have a handle on the GPU.
To manually trigger a reboot:
To manually trigger a GPU reset:
After creating the maintenance CR, monitor if the health check starts passing. If it does, the node should be automatically uncordoned. If the health check doesn’t pass, investigate with your organization’s support team.
6. Handling False Positives
If you believe the health event is a false positive, review the details saved earlier:
To override the cordoning:
Report the issue to your organization’s support team for investigation.
7. Verify Node Recovery
After remediation, monitor node status: