The Fault Remediation module is NVSentinel’s bridge to external repair systems. After a node has been quarantined and drained, this module creates maintenance requests that trigger break-fix workflows - such as node reboots, hardware replacements, or cloud provider interventions.
Think of it as a dispatch coordinator - similar to how a facility manager calls in specialists when equipment needs repair, Fault Remediation notifies your maintenance systems that a node is ready for servicing.
After NVSentinel isolates a faulty node and evacuates workloads, the hardware needs to be fixed:
The Fault Remediation module creates Kubernetes Custom Resources (CRDs) that external operators (like Janitor) watch and act upon to perform the actual repair work.
The Fault Remediation module watches the datastore for drained nodes that need repair:
External operators watch for these CRs and perform the actual maintenance work (reboot, terminate, replace, etc.).
Configure the Fault Remediation module through Helm values:
Flexible Go template system to match your maintenance operator:
Only creates requests when needed:
Updates node labels throughout remediation lifecycle:
remediating: Maintenance request createdremediation-succeeded: Maintenance completedremediation-failed: Maintenance encountered errorsGather diagnostics before remediation for troubleshooting and root cause analysis.
The Fault Remediation module creates CRDs consumed by external operators:
Janitor Operator: Watches for maintenance CRDs and performs cloud provider API calls to reboot/terminate nodes. Custom Break-Fix Systems: Define custom CRD schemas and deploy operators to integrate with your own maintenance systems. Manual Workflow Systems: Deploy a controller that creates tickets from CRs for manual processing in your ticketing system.