Fault Quarantine Configuration

View as Markdown

Overview

The Fault Quarantine module isolates nodes with detected hardware or software failures by cordoning and/or tainting them. This document covers all Helm configuration options available for system administrators.

Configuration Reference

Module Enable/Disable

Controls whether the fault-quarantine module is deployed in the cluster.

1global:
2 faultQuarantine:
3 enabled: true

Note: This module depends on the datastore being enabled. Therefore, ensure the datastore is also enabled.

Resources

Defines CPU and memory resource requests and limits for the fault-quarantine pod.

1fault-quarantine:
2 resources:
3 limits:
4 cpu: "1"
5 memory: "1Gi"
6 requests:
7 cpu: "1"
8 memory: "1Gi"

Logging

Sets the verbosity level for fault-quarantine logs.

1fault-quarantine:
2 logLevel: info # Options: debug, info, warn, error

Label Prefix

Defines the prefix for all node labels created by the module to track cordon/uncordon lifecycle.

1fault-quarantine:
2 labelPrefix: "k8saas.nvidia.com/"

Generated labels:

  • <labelPrefix>cordon-by - Service that cordoned the node
  • <labelPrefix>cordon-reason - Reason for cordoning
  • <labelPrefix>cordon-timestamp - Cordon timestamp (format: 2006-01-02T15-04-05Z)
  • <labelPrefix>uncordon-by - Service that uncordoned the node
  • <labelPrefix>uncordon-timestamp - Uncordon timestamp (format: 2006-01-02T15-04-05Z)

Circuit Breaker

Prevents too many nodes from being quarantined simultaneously, protecting against cluster-wide cascading failures.

Configuration

1fault-quarantine:
2 circuitBreaker:
3 enabled: true
4 percentage: 50
5 duration: "5m"

Parameters

enabled

Enables or disables circuit breaker protection. When disabled, unlimited nodes can be quarantined.

percentage

Maximum percentage of total cluster nodes that can be quarantined within the time window. When exceeded, the circuit breaker trips and blocks all new quarantine actions.

duration

Time window for tracking cordon events. The circuit breaker counts unique node cordons within this sliding window.

Configuration Examples

Aggressive:

1circuitBreaker:
2 enabled: true
3 percentage: 20
4 duration: "10m"

Conservative:

1circuitBreaker:
2 enabled: true
3 percentage: 75
4 duration: "3m"

Disabled:

1circuitBreaker:
2 enabled: false

Rule Sets

Rule sets define conditions for quarantining nodes using CEL expressions. Each rule set specifies match conditions (when to trigger) and actions (what to do).

Rule Set Structure

1fault-quarantine:
2 ruleSets:
3 - version: "1"
4 name: "ruleset-name"
5 priority: 100
6
7 match:
8 all:
9 - kind: "HealthEvent"
10 expression: "event.agent == 'gpu-health-monitor' && event.componentClass == 'GPU' && event.isFatal == true"
11 - kind: "Node"
12 expression: |
13 !('k8saas.nvidia.com/ManagedByNVSentinel' in node.metadata.labels && node.metadata.labels['k8saas.nvidia.com/ManagedByNVSentinel'] == "false")
14
15 any:
16 - kind: "HealthEvent"
17 expression: "event.agent == 'syslog-health-monitor' && event.componentClass == 'GPU' && event.isFatal == true"
18
19 cordon:
20 shouldCordon: true
21
22 taint:
23 key: "nvidia.com/gpu-error"
24 value: "fatal"
25 effect: "NoSchedule"

Parameters

version

Rule set format version for future compatibility.

name

Unique identifier used in logs, metrics, and as part of the cordon-reason label.

priority

Optional integer for resolving conflicts when multiple rule sets apply the same taint key-value pair. Higher values take precedence.

match

Defines conditions that must be satisfied for the rule set to trigger. Supports all (AND) and any (OR) logic.

kind

Specifies the object type to evaluate in the CEL expression. Valid values: HealthEvent (evaluates against health event data) or Node (evaluates against Kubernetes node object).

expression

CEL (Common Expression Language) expression that evaluates to true or false. For HealthEvent kind, access fields via event variable. For Node kind, access fields via node variable.

cordon

Specifies whether to mark the node as unschedulable when the rule matches.

taint

Optional Kubernetes taint to apply. Taints can prevent pod scheduling or evict existing pods based on the effect.

Example Rule Sets

Example 1: Fatal GPU Errors from GPU Health Monitor AND node not labeled with k8saas.nvidia.com/ManagedByNVSentinel=false

1ruleSets:
2 - version: "1"
3 name: "GPU fatal error ruleset"
4 match:
5 all:
6 - kind: "HealthEvent"
7 expression: "event.agent == 'gpu-health-monitor' && event.componentClass == 'GPU' && event.isFatal == true"
8 - kind: "Node"
9 expression: |
10 !('k8saas.nvidia.com/ManagedByNVSentinel' in node.metadata.labels &&
11 node.metadata.labels['k8saas.nvidia.com/ManagedByNVSentinel'] == "false")
12 cordon:
13 shouldCordon: true

Example 2: Syslog Fatal Errors Excluding XID 45 AND node not labeled with k8saas.nvidia.com/ManagedByNVSentinel=false

1ruleSets:
2 - version: "1"
3 name: "Syslog fatal error ruleset"
4 match:
5 all:
6 - kind: "HealthEvent"
7 expression: |
8 event.agent == 'syslog-health-monitor' &&
9 event.componentClass == 'GPU' &&
10 event.isFatal == true &&
11 (event.errorCode == null || !event.errorCode.exists(e, e == '45'))
12 - kind: "Node"
13 expression: |
14 !('k8saas.nvidia.com/ManagedByNVSentinel' in node.metadata.labels &&
15 node.metadata.labels['k8saas.nvidia.com/ManagedByNVSentinel'] == "false")
16 cordon:
17 shouldCordon: true