Fault Quarantine Configuration
Overview
The Fault Quarantine module isolates nodes with detected hardware or software failures by cordoning and/or tainting them. This document covers all Helm configuration options available for system administrators.
Configuration Reference
Module Enable/Disable
Controls whether the fault-quarantine module is deployed in the cluster.
Note: This module depends on the datastore being enabled. Therefore, ensure the datastore is also enabled.
Resources
Defines CPU and memory resource requests and limits for the fault-quarantine pod.
Logging
Sets the verbosity level for fault-quarantine logs.
Label Prefix
Defines the prefix for all node labels created by the module to track cordon/uncordon lifecycle.
Generated labels:
<labelPrefix>cordon-by- Service that cordoned the node<labelPrefix>cordon-reason- Reason for cordoning<labelPrefix>cordon-timestamp- Cordon timestamp (format: 2006-01-02T15-04-05Z)<labelPrefix>uncordon-by- Service that uncordoned the node<labelPrefix>uncordon-timestamp- Uncordon timestamp (format: 2006-01-02T15-04-05Z)
Circuit Breaker
Prevents too many nodes from being quarantined simultaneously, protecting against cluster-wide cascading failures.
Configuration
Parameters
enabled
Enables or disables circuit breaker protection. When disabled, unlimited nodes can be quarantined.
percentage
Maximum percentage of total cluster nodes that can be quarantined within the time window. When exceeded, the circuit breaker trips and blocks all new quarantine actions.
duration
Time window for tracking cordon events. The circuit breaker counts unique node cordons within this sliding window.
Configuration Examples
Aggressive:
Conservative:
Disabled:
Rule Sets
Rule sets define conditions for quarantining nodes using CEL expressions. Each rule set specifies match conditions (when to trigger) and actions (what to do).
Rule Set Structure
Parameters
version
Rule set format version for future compatibility.
name
Unique identifier used in logs, metrics, and as part of the cordon-reason label.
priority
Optional integer for resolving conflicts when multiple rule sets apply the same taint key-value pair. Higher values take precedence.
match
Defines conditions that must be satisfied for the rule set to trigger. Supports all (AND) and any (OR) logic.
kind
Specifies the object type to evaluate in the CEL expression. Valid values: HealthEvent (evaluates against health event data) or Node (evaluates against Kubernetes node object).
expression
CEL (Common Expression Language) expression that evaluates to true or false. For HealthEvent kind, access fields via event variable. For Node kind, access fields via node variable.
cordon
Specifies whether to mark the node as unschedulable when the rule matches.
taint
Optional Kubernetes taint to apply. Taints can prevent pod scheduling or evict existing pods based on the effect.