Runbook: Circuit Breaker Troubleshooting

View as Markdown

Overview

This runbook guides you through investigating and resetting a tripped circuit breaker. The circuit breaker automatically trips when too many nodes are cordoned within a short time window, blocking new remediation actions until manually reset.

Key points:

  • Circuit breaker does NOT auto-reset - requires manual intervention
  • In-progress remediation continues, but no new actions are taken
  • Only reset after investigating and fixing the root cause

Procedure

1. Verify Status

$kubectl get cm circuit-breaker -n nvsentinel -o jsonpath='{.data.status}'
  • TRIPPED = protection mode active
  • CLOSED = normal operation

2. Identify Cordoned Nodes

$# List cordoned nodes
$kubectl get nodes | grep SchedulingDisabled
$
$# Count affected nodes
$kubectl get nodes | grep SchedulingDisabled | wc -l

3. Investigate Root Cause

Check each cordoned node:

$kubectl describe node <NODE_NAME>

Look at Conditions and Events sections for hardware failures.

4. Analyze the Pattern

Same error on all nodes? → Configuration issue or infrastructure problem Specific node group affected? → Likely a deployment issue or a rack level infrastructure problem Issues resolved? → Temporary infrastructure problem

5. Check for Flapping Health Checks

Query kube-state-metrics to see historic node conditions:

1# Count condition transitions over the last hour (flapping = frequent changes)
2changes(kube_node_status_condition{condition=~"Gpu.*|SysLogs.*", status="true"}[1h]) > 0
3
4# View current GPU/Syslog conditions across all nodes
5kube_node_status_condition{condition=~"Gpu.*|SysLogs.*", status="true"} == 1

If health checks are flapping, this typically indicates an infrastructure issue that is auto-resolving (network blips, storage latency, etc.). This requires deeper investigation of the underlying infrastructure.

6. Uncordon Affected Nodes

Manually uncordon nodes that are now healthy:

$kubectl uncordon <NODE_NAME>

Repeat for each affected node.

7. Reset Circuit Breaker

⚠️ Only reset after fixing the root cause

Choose the appropriate reset mode based on your situation:

Use cursor: CREATE to discard events that accumulated while the circuit breaker was tripped. This prevents immediate re-tripping from stale events.

$kubectl patch cm circuit-breaker -n nvsentinel \
> -p '{"data":{"status":"CLOSED","cursor":"CREATE"}}'
$kubectl rollout restart deploy fault-quarantine -n nvsentinel
$kubectl rollout status deploy fault-quarantine -n nvsentinel

Option B: Process accumulated events

Use cursor: RESUME (or omit cursor) to process all events that occurred while tripped. Only use this if you’re confident the accumulated events won’t cause immediate re-tripping.

$kubectl patch cm circuit-breaker -n nvsentinel \
> -p '{"data":{"status":"CLOSED","cursor":"RESUME"}}'
$kubectl rollout restart deploy fault-quarantine -n nvsentinel
$kubectl rollout status deploy fault-quarantine -n nvsentinel

Note: The cursor automatically resets to RESUME after startup, so you don’t need to change it back manually.

8. Verify Normal Operation

Monitor for 15-30 minutes:

$# Watch circuit breaker status
$kubectl get cm circuit-breaker -n nvsentinel -o jsonpath='{.data.status}'
$
$# Monitor nodes
$kubectl get nodes -w