Fault Remediation Configuration

View as Markdown

Overview

The Fault Remediation module creates maintenance Custom Resources (CRs) that trigger external repair systems to fix faulty nodes. This document covers all Helm configuration options and extension points for system administrators.

Configuration Reference

Module Enable/Disable

Controls whether the fault-remediation module is deployed in the cluster.

1global:
2 faultRemediation:
3 enabled: true

Note: This module depends on the results from fault-quarantine and node-drainer. It also depends on the datastore being enabled. Therefore, ensure the datastore and the other modules are also enabled.

Resources

Defines CPU and memory resource requests and limits for the fault-remediation pod.

1fault-remediation:
2 resources:
3 limits:
4 cpu: "200m"
5 memory: "300Mi"
6 requests:
7 cpu: "200m"
8 memory: "300Mi"

Logging

Sets the verbosity level for fault-remediation logs.

1fault-remediation:
2 logLevel: info # Options: debug, info, warn, error

Maintenance Resource Configuration

Defines the Custom Resource that will be created to trigger remediation actions.

Configuration Structure

1fault-remediation:
2 maintenance:
3 actions:
4 "COMPONENT_RESET":
5 apiGroup: "janitor.dgxc.nvidia.com"
6 version: "v1alpha1"
7 kind: "GPUReset"
8 scope: "Cluster"
9 completeConditionType: "Complete"
10 templateFileName: "gpureset-template.yaml"
11 equivalenceGroup: "reset"
12 supersedingEquivalenceGroups: ["restart"]
13 impactedEntityScope: "GPU_UUID"
14
15 templates:
16 "gpureset-template.yaml": |
17 apiVersion: {{.ApiGroup}}/{{.Version}}
18 kind: GPUReset
19 metadata:
20 name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
21 spec:
22 nodeName: {{ .HealthEvent.NodeName }}
23 selector:
24 uuids:
25 - {{ .ImpactedEntityScopeValue }}

Parameters

apiGroup

API group of the maintenance CRD installed by your maintenance operator.

version

API version of the maintenance CRD.

kind

Kubernetes Kind of the maintenance CRD.

scope

Determines whether the maintenance CRD is cluster-scoped or namespaced.

completeConditionType

Status condition name to check for maintenance completion. Used to prevent duplicate CRs when multiple faults occur on the same node. If condition status is True, maintenance is complete.

namespace

Kubernetes namespace where maintenance CRs will be created.

equivalenceGroup

Defines which remediation actions are considered equivalent for deduplication. Actions in the same group will deduplicate against each other regardless of CRD type if a previous CRD is in a non-terminal state.

supersedingEquivalenceGroups

Defines additional equivalence groups that are considered equivalent for deduplication. For example, the COMPONENT_RESET action in the reset group should be deduplicated with the RESTART_VM action in the restart group. In other words, rebooting a node will have the same effect as resetting a GPU whereas the inverse is not true.

impactedEntityScope

For the COMPONENT_RESET action, the impacted entity scope should be defined so that there’s a unique equivalence group for each entity. The unique equivalence group is constructed by appending the value for the given impacted entity to the equivalence group name. For example, each GPU needing reset will be in its own equivalence group named like reset-<GPU_UUID>.

templates

Go template that generates the maintenance CR YAML. See Template Extension Point section below.

Template Extension Point

The maintenance template is a Go template that generates the Kubernetes CR YAML for remediation actions.

Available Template Variables

  • .NodeName (string) - Name of the node requiring maintenance
  • .HealthEventID (string) - Unique ID of the triggering health event
  • .HealthEvent (HealthEvent) - The entire content of the triggering health event
  • .RecommendedAction (int) - Numeric action code from health event (see health_event.proto)
  • .RecommendedActionName (string) - Action name from the health event
  • .ImpactedEntityScopeValue (string) - The GPU_UUID used in COMPONENT_RESET remediation actions
  • .ApiGroup (string) - Value from maintenance.apiGroup
  • .Version (string) - Value from maintenance.version
  • .Kind (string) - Value from maintenance.kind
  • .Namespace (string) - Value from maintenance.namespace

Template Examples

Example 1: Basic Reboot Template

1maintenance:
2actions:
3 "RESTART_VM":
4 apiGroup: "janitor.dgxc.nvidia.com"
5 version: "v1alpha1"
6 kind: "RebootNode"
7 scope: "Cluster"
8 completeConditionType: "NodeReady"
9 templateFileName: "rebootnode-template.yaml"
10 equivalenceGroup: "restart"
11
12templates:
13 "rebootnode-template.yaml": |
14 apiVersion: janitor.dgxc.nvidia.com/v1alpha1
15 kind: RebootNode
16 metadata:
17 name: maintenance-{{ .NodeName }}-{{ .HealthEventID }}
18 spec:
19 nodeName: {{ .NodeName }}

Example 2: Template with Conditional Logic

1maintenance:
2actions:
3 "RESTART_VM":
4 apiGroup: "maintenance.example.com"
5 version: "v1"
6 kind: "NodeMaintenance"
7 scope: "Cluster"
8 completeConditionType: "NodeReady"
9 templateFileName: "maintenance-template.yaml"
10 equivalenceGroup: "maintenance"
11
12templates:
13 "maintenance-template.yaml": |
14 apiVersion: maintenance.example.com/v1
15 kind: NodeMaintenance
16 metadata:
17 name: maintenance-{{ .NodeName }}-{{ .HealthEventID }}
18 spec:
19 nodeName: {{ .NodeName }}
20 {{- if eq .RecommendedAction 15 }}
21 action: reboot
22 {{- else if eq .RecommendedAction 25 }}
23 action: terminate
24 {{- else }}
25 action: investigate
26 {{- end }}

Template Guidelines

  1. Unique Names: Use .NodeName and .HealthEventID in CR name to ensure uniqueness
  2. Owner Reference: The module automatically adds the Node as owner for automatic cleanup
  3. Action Codes: Use conditional logic based on .RecommendedAction for different repair types

Update Retry Configuration

Controls retry behavior when updating node annotations after creating maintenance CRs.

1fault-remediation:
2 updateRetry:
3 maxRetries: 5
4 retryDelaySeconds: 10

Parameters

maxRetries

Maximum number of retry attempts if annotation updates fail due to conflicts or network issues.

retryDelaySeconds

Base delay in seconds between retry attempts. Uses exponential backoff.

Log Collector Configuration

Optionally collects diagnostic logs from nodes before remediation.

Configuration

1fault-remediation:
2 logCollector:
3 enabled: false
4 image:
5 repository: ghcr.io/nvidia/nvsentinel/log-collector
6 pullPolicy: IfNotPresent
7 uploadURL: "http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local/upload"
8 gpuOperatorNamespaces: "gpu-operator"
9 enableGcpSosCollection: false
10 enableAwsSosCollection: false
11 timeout: "10m"
12 env: {}

Parameters

enabled

Enable or disable automatic log collection before creating maintenance CRs.

image.repository

Container image for the log collector.

image.pullPolicy

Pull policy for the log collector image.

uploadURL

HTTP endpoint where collected logs will be uploaded.

gpuOperatorNamespaces

Comma-separated list of namespaces containing GPU operator components for log collection.

enableGcpSosCollection

Enable collection of GCP-specific SOS reports.

enableAwsSosCollection

Enable collection of AWS-specific SOS reports.

timeout

Maximum time to wait for log collection job to complete.

env

Additional environment variables to pass to the log collector container.

Integration with External Operators

The fault-remediation module is designed to integrate with external maintenance operators:

  1. CR Creation: Fault-remediation creates a maintenance CR based on your template
  2. Operator Detection: Your maintenance operator watches for new CRs
  3. Remediation Execution: Operator performs the actual remediation (reboot, terminate, etc.)
  4. Status Update: Operator updates the CR status with completion/failure information
  5. Completion Detection: Fault-remediation checks completeConditionType to detect completion

Operator Requirements

Your maintenance operator must:

  • Watch for CRs matching your configured apiGroup, version, and kind
  • Update CR status with a condition matching completeConditionType
  • Set condition status to True on success, False on failure
  • Handle node reboots, terminations, or other remediation actions

Example Operator Status Update

1status:
2 conditions:
3 - type: NodeReady
4 status: "True"
5 reason: RebootComplete
6 message: Node successfully rebooted and returned to Ready state
7 lastTransitionTime: "2025-11-28T10:30:00Z"