Fault Remediation Configuration
Overview
The Fault Remediation module creates maintenance Custom Resources (CRs) that trigger external repair systems to fix faulty nodes. This document covers all Helm configuration options and extension points for system administrators.
Configuration Reference
Module Enable/Disable
Controls whether the fault-remediation module is deployed in the cluster.
Note: This module depends on the results from fault-quarantine and node-drainer. It also depends on the datastore being enabled. Therefore, ensure the datastore and the other modules are also enabled.
Resources
Defines CPU and memory resource requests and limits for the fault-remediation pod.
Logging
Sets the verbosity level for fault-remediation logs.
Maintenance Resource Configuration
Defines the Custom Resource that will be created to trigger remediation actions.
Configuration Structure
Parameters
apiGroup
API group of the maintenance CRD installed by your maintenance operator.
version
API version of the maintenance CRD.
kind
Kubernetes Kind of the maintenance CRD.
scope
Determines whether the maintenance CRD is cluster-scoped or namespaced.
completeConditionType
Status condition name to check for maintenance completion. Used to prevent duplicate CRs when multiple faults occur on the same node. If condition status is True, maintenance is complete.
namespace
Kubernetes namespace where maintenance CRs will be created.
equivalenceGroup
Defines which remediation actions are considered equivalent for deduplication. Actions in the same group will deduplicate against each other regardless of CRD type if a previous CRD is in a non-terminal state.
supersedingEquivalenceGroups
Defines additional equivalence groups that are considered equivalent for deduplication. For example, the COMPONENT_RESET action in the reset group should be deduplicated with the RESTART_VM action in the restart group. In other words, rebooting a node will have the same effect as resetting a GPU whereas the inverse is not true.
impactedEntityScope
For the COMPONENT_RESET action, the impacted entity scope should be defined so that there’s a unique equivalence group for each entity. The unique equivalence group is constructed by appending the value for the given impacted entity to the equivalence group name. For example, each GPU needing reset will be in its own equivalence group named like reset-<GPU_UUID>.
templates
Go template that generates the maintenance CR YAML. See Template Extension Point section below.
Template Extension Point
The maintenance template is a Go template that generates the Kubernetes CR YAML for remediation actions.
Available Template Variables
.NodeName(string) - Name of the node requiring maintenance.HealthEventID(string) - Unique ID of the triggering health event.HealthEvent(HealthEvent) - The entire content of the triggering health event.RecommendedAction(int) - Numeric action code from health event (see health_event.proto).RecommendedActionName(string) - Action name from the health event.ImpactedEntityScopeValue(string) - The GPU_UUID used in COMPONENT_RESET remediation actions.ApiGroup(string) - Value frommaintenance.apiGroup.Version(string) - Value frommaintenance.version.Kind(string) - Value frommaintenance.kind.Namespace(string) - Value frommaintenance.namespace
Template Examples
Example 1: Basic Reboot Template
Example 2: Template with Conditional Logic
Template Guidelines
- Unique Names: Use
.NodeNameand.HealthEventIDin CR name to ensure uniqueness - Owner Reference: The module automatically adds the Node as owner for automatic cleanup
- Action Codes: Use conditional logic based on
.RecommendedActionfor different repair types
Update Retry Configuration
Controls retry behavior when updating node annotations after creating maintenance CRs.
Parameters
maxRetries
Maximum number of retry attempts if annotation updates fail due to conflicts or network issues.
retryDelaySeconds
Base delay in seconds between retry attempts. Uses exponential backoff.
Log Collector Configuration
Optionally collects diagnostic logs from nodes before remediation.
Configuration
Parameters
enabled
Enable or disable automatic log collection before creating maintenance CRs.
image.repository
Container image for the log collector.
image.pullPolicy
Pull policy for the log collector image.
uploadURL
HTTP endpoint where collected logs will be uploaded.
gpuOperatorNamespaces
Comma-separated list of namespaces containing GPU operator components for log collection.
enableGcpSosCollection
Enable collection of GCP-specific SOS reports.
enableAwsSosCollection
Enable collection of AWS-specific SOS reports.
timeout
Maximum time to wait for log collection job to complete.
env
Additional environment variables to pass to the log collector container.
Integration with External Operators
The fault-remediation module is designed to integrate with external maintenance operators:
- CR Creation: Fault-remediation creates a maintenance CR based on your template
- Operator Detection: Your maintenance operator watches for new CRs
- Remediation Execution: Operator performs the actual remediation (reboot, terminate, etc.)
- Status Update: Operator updates the CR status with completion/failure information
- Completion Detection: Fault-remediation checks
completeConditionTypeto detect completion
Operator Requirements
Your maintenance operator must:
- Watch for CRs matching your configured
apiGroup,version, andkind - Update CR status with a condition matching
completeConditionType - Set condition status to
Trueon success,Falseon failure - Handle node reboots, terminations, or other remediation actions