Integrations | NVIDIA NVSentinel Documentation

NVSentinel detects GPU and hardware failures and exposes them using standard Kubernetes primitives. This document provides a high level overview of how to integrate with NVSentinel for scheduling, monitoring, and remediation purposes.

Integration Model

Think of NVSentinel integration in four layers:

Is a node bad? → Check Taints
- Taints mark nodes with hardware issues
- Use taints for scheduling decisions and filtering
- React to taint presence/absence in automation
Why is a node bad? → Check Node Conditions
- Conditions provide detailed diagnostic information
- Use conditions for monitoring, alerting, and dashboards
- Each condition explains what hardware component failed
Can I use my own remediation? → Provide a Custom Resource
- NVSentinel triggers external systems via CRs
- Integrate with cloud APIs, DCIM, or custom controllers
- You retain full control over how nodes are repaired
How do I customize drain behavior? → Configure per-namespace eviction modes
- Control how workloads are evicted from failing nodes
- Define different policies for stateless vs stateful workloads
- Set timeouts and grace periods per namespace
Should GPU pods run diagnostics before the workload starts? → Enable Preflight
- Opt-in per namespace; webhook injects init-container checks (DCGM, optional NCCL)
- Multi-node jobs use gang discovery (native Workload API or PodGroup-style schedulers like Volcano and Run:ai)
- Separate from the MongoDB health-event pipeline (see Data flow)

Quick Start

For Scheduling Decisions:

Find nodes with NVSentinel taints (if configured):

$ kubectl get nodes -o json | jq '.items[] 
>   | select(.spec.taints[]? 
>   | select(.key | startswith("nvidia.com/"))) 
>   | .metadata.name'

For Monitoring:

Get detailed failure information:

$ kubectl get nodes -o json | jq '.items[].status.conditions[] 
>   | select(.type | startswith("Gpu"))'

For Pod Tolerations:

1 # Match the taint configured in your fault-quarantine rulesets
2 tolerations:
3   - key: "nvidia.com/gpu-xid-error"
4     operator: "Equal"
5     value: "true"
6     effect: "NoSchedule"

Architecture

The DATA_FLOW.md provides more context on this, at the higher level though, NVSentinel detects hardware failures and applies graduated responses via:

Detection: Health monitors check GPU, system logs, and cloud maintenance events
Classification: Platform connectors validate and set node conditions
Quarantine: Fault quarantine evaluates rules and applies taints/cordons
Evacuation: Node drainer evicts workloads per configured policies
Remediation: Fault remediation triggers external systems via CRs

┌─────────────────────┐
│  Health Monitors    │ GPU, Syslog, CSP health detection
└──────────┬──────────┘
           │ Detect failures
           ▼
┌─────────────────────┐
│ Platform Connectors │ Set NodeConditions (why is it bad?)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Fault Quarantine    │ Apply Cordon/Taints (node is bad)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Node Drainer        │ Evict workloads per policy
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Fault Remediation   │ Trigger external systems (your CR)
└─────────────────────┘

1. Is a Node Bad? Check Taints

Use taints for all scheduling and automation decisions.

Taints are the primary signal that a node has hardware issues. External systems should watch for taint presence/absence to make scheduling decisions, trigger alerts, or initiate remediation workflows.

Note: Taints are optional and disabled by default. You must configure them in fault-quarantine rulesets by uncommenting the taint section. NVSentinel only cordons nodes by default.

Taint Structure

Format: User-configurable via rulesets. Common patterns:

Option 1: Component-specific (recommended)

nvidia.com/gpu-xid-error
nvidia.com/gpu-nvlink-error
nvidia.com/syslog-xid-error

Option 2: Hierarchical (proposed pattern)

gpu.health/memory-error
nvlink.health/link-down
nvswitch.health/fatal-error

Default Taint Examples

NVSentinel’s test suite demonstrates these taint configurations:

Taint Key	Value	Effect	Use Case
`nvidia.com/gpu-xid-error`	`true`	`NoSchedule`	GPU XID critical errors
`nvidia.com/gpu-nvlink-error`	`true`	`NoSchedule`	NVLink connection failures
`nvidia.com/syslog-xid-error`	`true`	`NoSchedule`	Syslog-detected XID errors
`nvidia.com/gpu-error`	`true`	`NoSchedule`	Generic GPU hardware errors

You can configure any taint keys/values in your rulesets based on your needs.

Taint Effect Guidelines

Effect	Use Case	Impact
`NoSchedule`	Fatal errors requiring remediation	New pods without toleration won’t be scheduled
`PreferNoSchedule`	Degraded state or warnings	Scheduler tries to avoid but will schedule if necessary
`NoExecute`	Immediate evacuation needed	Existing pods without toleration are evicted (rarely used)

Configuring Taints

Taints are defined in Fault Quarantine rulesets. Here’s an example showing how to enable taints:

1 # distros/kubernetes/nvsentinel/charts/fault-quarantine/values.yaml
2 rulesets:
3   - version: "1"
4     name: "GPU XID Critical Errors"
5     priority: 100
6     match:
7       any:
8         - kind: "HealthEvent"
9           expression: 'event.checkName == "GpuXidError" && event.isFatal == true'
10     # Uncomment to enable tainting:
11     #taint:
12     #  key: "nvidia.com/gpu-xid-error"  # Choose your own key format
13     # value: "true"                    # Or use "fatal", "degraded", etc.
14     # effect: "NoSchedule"
15     cordon:
16       shouldCordon: true  # Enabled by default

Key Points:

Taints are commented out by default - you must enable them
You control the taint key format (nvidia.com/* or gpu.health/* or any custom format)
You control the taint values (true, fatal, degraded, etc.)
Cordoning is enabled by default; tainting is opt-in

Integration Patterns

Check if node has any NVIDIA-related taints:

$ kubectl get nodes -o json | jq '.items[] 
>   | select(.spec.taints[]? 
>   | select(.key | startswith("nvidia.com/"))) 
>   | .metadata.name'

Check for specific error type:

$ kubectl get nodes -o json | jq '.items[] 
>   | select(.spec.taints[]? 
>   | select(.key == "nvidia.com/gpu-xid-error")) 
>   | .metadata.name'

Tolerate specific taints in pod specs:

1 apiVersion: v1
2 kind: Pod
3 metadata:
4   name: gpu-workload
5 spec:
6   tolerations:
7     # Match the exact taint configured in your rulesets
8     - key: "nvidia.com/gpu-xid-error"
9       operator: "Equal"
10       value: "true"
11       effect: "NoSchedule"

Watch for taint changes (automation):

1 informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
2     UpdateFunc: func(oldObj, newObj interface{}) {
3         newNode := newObj.(*corev1.Node)
4         
5         // Check for NVIDIA-related taints
6         for _, taint := range newNode.Spec.Taints {
7             if strings.HasPrefix(taint.Key, "nvidia.com/") {
8                 // Trigger alert, update scheduler, etc.
9                 log.Printf("Node %s has taint %s=%s", 
10                     newNode.Name, taint.Key, taint.Value)
11             }
12         }
13     },
14 })

2. Why is a Node Bad? Check Node Conditions

Use node conditions for monitoring, alerting, and detailed diagnostics.

While taints tell you “this node is bad”, conditions tell you why it’s bad. Use conditions for dashboards, alerts, and troubleshooting.

Monitoring with kube-state-metrics

Prerequisites: Install kube-state-metrics to expose node conditions as Prometheus metrics. NVSentinel sets node conditions via the Kubernetes API, but kube-state-metrics is required to convert these into metrics.

Available Metrics:

1 # Monitor specific GPU health conditions
2 kube_node_status_condition{condition="GpuMemWatch",status="true"} == 1
3 kube_node_status_condition{condition="GpuNvlinkWatch",status="true"} == 1
4 kube_node_status_condition{condition="SysLogsXIDError",status="true"} == 1
5 
6 # Count unhealthy GPU nodes
7 count(kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"})
8 
9 # Alert on any GPU or syslog condition
10 kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"}

Example Prometheus Alert:

1 - alert: GPUNodeUnhealthy
2   expr: |
3     kube_node_status_condition{condition=~"Gpu.*",status="true"} == 1
4   for: 5m
5   labels:
6     severity: critical
7   annotations:
8     summary: "GPU node {{ $labels.node }} has condition {{ $labels.condition }}"
9     description: "Node {{ $labels.node }} is unhealthy due to {{ $labels.condition }}"

Grafana Dashboard Query:

1 # Show all nodes with active GPU health conditions
2 kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"}

Note: NVSentinel also exposes its own Prometheus metrics for internal operations. See /nvsentinel/observability/metrics-reference for the complete list of NVSentinel-native metrics.

Condition Structure

Platform Connectors set NodeConditions based on health monitor checks. Each condition explains what hardware component failed.

Naming: PascalCase, directly from health monitor check names
Examples: GpuMemWatch, GpuThermalWatch, SysLogsXIDError

Condition vs Event Behavior

NVSentinel uses different Kubernetes primitives based on error severity:

Error Type	Condition Set	Event Created?	Use Case
Fatal (`isFatal=true`)	✅ Yes (`status=True`)	❌ No	Critical errors requiring quarantine/remediation
Non-Fatal (`isFatal=false`)	❌ No	✅ Yes	Warnings, transient issues, informational
Healthy (`isHealthy=true`)	✅ Yes (`status=False`)	❌ No	Health recovery, condition cleared

Why this design?

Conditions are durable state - used for errors that require action (cordon, drain, remediation)
Events are transient notifications - used for warnings and non-critical issues that don’t require node isolation

Using Events for Non-Fatal Errors

Non-fatal errors (like thermal throttling warnings or transient issues) create Kubernetes Events instead of node conditions. This prevents alert fatigue while still providing visibility.

View recent events for a node:

$ kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=gpu-node-01 \
>   --sort-by='.lastTimestamp'

Filter for GPU-related events:

$ kubectl get events --all-namespaces \
>   -o json | jq '.items[] | select(.type | startswith("Gpu") or startswith("SysLogs"))'

Watch for real-time events:

$ kubectl get events --watch --field-selector involvedObject.kind=Node

Example non-fatal event:

1 apiVersion: v1
2 kind: Event
3 metadata:
4   name: gpu-node-01.17a3b2c4d5e6f7
5   namespace: default
6 involvedObject:
7   kind: Node
8   name: gpu-node-01
9 reason: Warning
10 message: "[DCGM_FR_CLOCK_THROTTLE_THERMAL] GPU thermal throttling detected - RecommendedAction: NONE"
11 type: GpuThermalWatch
12 source:
13   component: gpu-health-monitor
14   host: gpu-node-01
15 firstTimestamp: "2025-11-06T10:05:00Z"
16 lastTimestamp: "2025-11-06T10:05:00Z"
17 count: 1

Integration patterns for events:

1 // Watch for non-fatal GPU events
2 eventInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
3     AddFunc: func(obj interface{}) {
4         event := obj.(*corev1.Event)
5         if event.InvolvedObject.Kind == "Node" && 
6            (strings.HasPrefix(event.Type, "Gpu") || strings.HasPrefix(event.Type, "SysLogs")) {
7             // Log warning, update dashboard, etc.
8             log.Printf("Non-fatal issue on %s: %s", 
9                 event.InvolvedObject.Name, event.Message)
10         }
11     },
12 })

Condition Status

Status	Meaning
`True`	Error/fault detected
`False`	Component healthy
`Unknown`	Health state cannot be determined

Condition Message Format

Messages include error codes and recommended actions:

[ErrorCode1, ErrorCode2] Human-readable description - RecommendedAction: ACTION_NAME

Example:

1 conditions:
2   - type: GpuMemoryError
3     status: "True"
4     reason: HardwareFailure
5     message: "[DCGM_FR_FAULTY_MEMORY] GPU memory failure detected on GPU 0 - RecommendedAction: RESTART_VM"
6     lastTransitionTime: "2025-11-06T10:00:00Z"

Standard Condition Types

GPU Conditions (from GPU Health Monitor - DCGM)

GpuMemWatch - GPU memory failures (ECC errors, faulty memory)
GpuThermalWatch - Thermal throttling or temperature violations
GpuPcieWatch - PCIe link issues (replay rate, bandwidth)
GpuPowerWatch - Power-related issues
GpuInforomWatch - Inforom corruption detected
GpuSmWatch - Streaming Multiprocessor errors
GpuNvlinkWatch - NVLink connection failures
GpuMcuWatch - Microcontroller unit errors
GpuPmuWatch - Power management unit errors
GpuDriverWatch - GPU driver errors
GpuCpusetWatch - CPU affinity issues

Syslog Conditions (from Syslog Health Monitor)

SysLogsXIDError - GPU XID errors detected in system logs
SysLogsSXIDError - NVSwitch SXID errors detected in system logs
SysLogsGPUFallenOff - GPU fallen off bus errors detected in system logs

NVSwitch Conditions

NVSwitchFatalError - Fatal NVSwitch hardware error
NVSwitchDown - NVSwitch unavailable
NVSwitchNonFatalError - Non-fatal NVSwitch errors (warnings)

System Conditions

DCGMError - DCGM daemon or API failures
CSPMaintenance - Cloud provider scheduled maintenance
SyslogError - System log analysis detected issues

Integration Patterns

Monitor specific condition types:

$ kubectl get nodes -o json | jq '.items[] 
>   | select(.status.conditions[] | select(.type=="GpuMemWatch" and .status=="True")) 
>   | .metadata.name'

Watch for condition changes:

$ kubectl get nodes -w -o json | jq -c 'select(.status.conditions[] | select(.type | startswith("Gpu")))'

Prometheus alert example:

1 groups:
2   - name: nvsentinel
3     rules:
4       - alert: GpuMemoryError
5         expr: kube_node_status_condition{condition="GpuMemWatch",status="true"} == 1
6         annotations:
7           summary: "GPU memory error on {{ $labels.node }}"

client-go example:

1 informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
2     UpdateFunc: func(oldObj, newObj interface{}) {
3         newNode := newObj.(*corev1.Node)
4         for _, condition := range newNode.Status.Conditions {
5             if strings.HasPrefix(string(condition.Type), "Gpu") && condition.Status == corev1.ConditionTrue {
6                 // Send alert with condition.Message
7                 log.Printf("GPU issue on %s: %s", newNode.Name, condition.Message)
8             }
9         }
10     },
11 })

3. Can I Use My Own Remediation? Provide a Custom Resource

NVSentinel triggers external systems by creating Kubernetes Custom Resources.

After detecting and draining a failing node, NVSentinel creates a CR that your controller watches. This gives you full control over remediation - integrate with cloud APIs, DCIM systems, or custom workflows.

Integration Architecture

┌────────────────────┐
│ Fault Remediation  │ Watches drained nodes
│     Module         │
└─────────┬──────────┘
          │ Creates CR based on RecommendedAction
          ▼
┌────────────────────┐
│ Kubernetes API     │ Custom Resource created
│  (RebootNode,      │
│   TerminateNode)   │
└─────────┬──────────┘
          │ Watched by external controller
          ▼
┌────────────────────┐
│ External System    │ Janitor, cloud APIs, DCIM
│  (Your Controller) │
└────────────────────┘

Configuration

Configure the maintenance CR template and behavior:

1 # distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
2 maintenance:
3   # API group of your maintenance CRD
4   apiGroup: "janitor.dgxc.nvidia.com"
5   version: "v1alpha1"
6   kind: "RebootNode"
7   
8   # Completion condition to check before creating new CRs
9   # Prevents duplicate remediation requests for the same node
10   completeConditionType: "NodeReady"
11   
12   # Namespace where maintenance CRs will be created
13   namespace: "nvsentinel"
14   
15   # Resource names for RBAC permissions
16   resourceNames:
17     - "rebootnodes"
18     - "terminatenodes"
19   
20   # Go template for generating maintenance CRs
21   # Available variables: .ApiGroup, .Version, .RecommendedAction, .NodeName, .HealthEventID
22   template: |
23     apiVersion: {{ .ApiGroup }}/{{ .Version }}
24     kind: {{ if eq .RecommendedAction 2 }}RebootNode{{ else }}TerminateNode{{ end }}
25     metadata:
26       name: maintenance-{{ .NodeName }}-{{ .HealthEventID }}
27       namespace: {{ .Namespace }}
28     spec:
29       nodeName: {{ .NodeName }}
30       reason: "Health event {{ .HealthEventID }}"
31       force: false
32 
33 # Retry configuration for CR creation
34 updateRetry:
35   maxRetries: 5
36   retryDelaySeconds: 10

Custom Resource Template

The template uses Go template syntax with these variables:

Variable	Type	Description
`.ApiGroup`	string	API group from `maintenance.apiGroup`
`.Version`	string	API version from `maintenance.version`
`.Kind`	string	Resource kind from `maintenance.kind`
`.RecommendedAction`	int	Numeric action code (2=reboot, 15=terminate)
`.NodeName`	string	Name of the node requiring remediation
`.HealthEventID`	string	Unique ID of the triggering health event
`.Namespace`	string	Namespace from `maintenance.namespace`

RecommendedAction Codes

Code	Action	Typical Use Case
`2`	`COMPONENT_RESET`	GPU/driver reset, reboot node
`5`	`CONTACT_SUPPORT`	Manual intervention needed
`15`	`RESTART_VM`	Reboot VM instance
`24`	`RESTART_BM`	Reboot bare metal node
`25`	`REPLACE_VM`	Terminate and replace VM

Integration Examples

Example 1: Janitor Controller Integration

Janitor controller watches for RebootNode and TerminateNode CRs:

1 apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2 kind: RebootNode
3 metadata:
4   name: maintenance-gpu-node-01-673bac8e9f1234567890abcd
5   namespace: nvsentinel
6 spec:
7   nodeName: gpu-node-01
8   reason: "Health event 673bac8e9f1234567890abcd"
9   force: false
10 status:
11   conditions:
12     - type: NodeReady
13       status: "False"
14       reason: "RebootInProgress"

Example 2: Cloud Provider Integration

Custom template for cloud-specific maintenance:

1 maintenance:
2   apiGroup: "cloud.example.com"
3   version: "v1"
4   kind: "NodeMaintenance"
5   template: |
6     apiVersion: {{ .ApiGroup }}/{{ .Version }}
7     kind: NodeMaintenance
8     metadata:
9       name: {{ .NodeName }}-{{ .HealthEventID }}
10     spec:
11       nodeName: {{ .NodeName }}
12       action: {{ if eq .RecommendedAction 2 }}"reboot"{{ else if eq .RecommendedAction 15 }}"restart"{{ else }}"replace"{{ end }}
13       provider:
14         region: "us-west-2"
15         instanceId: "{{ .NodeName }}"

Example 3: DCIM Integration

Template for data center infrastructure management:

1 maintenance:
2   apiGroup: "dcim.example.com"
3   version: "v1alpha1"
4   kind: "ServerMaintenance"
5   template: |
6     apiVersion: {{ .ApiGroup }}/{{ .Version }}
7     kind: ServerMaintenance
8     metadata:
9       name: server-{{ .NodeName }}
10     spec:
11       serverName: {{ .NodeName }}
12       maintenanceType: {{ if eq .RecommendedAction 2 }}"reboot"{{ else }}"replace"{{ end }}
13       priority: "high"
14       ticketId: "HEALTH-{{ .HealthEventID }}"

Completion Detection

Fault Remediation checks the completeConditionType status on existing CRs before creating new ones:

Status: True - Maintenance completed successfully, new CR can be created
Status: False - Maintenance failed, new CR can be created for retry
Condition Missing - Maintenance in progress, skip CR creation

This prevents duplicate remediation requests for nodes with ongoing maintenance.

Testing Your Integration

Validate Template Syntax:

$ # Dry-run mode to validate template without creating CRs
$ helm install nvsentinel --set global.dryRun=true ...

Monitor CR Creation:

$ # Watch for maintenance CRs
$ kubectl get rebootnodes -n nvsentinel -w

Check Fault Remediation Logs:

$ kubectl logs -n nvsentinel deployment/fault-remediation -f

Configuration Location: distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml

4. How Do I Customize Drain Behavior? Configure Eviction Modes

Control how workloads are evicted from failing nodes.

The Node Drainer module handles graceful workload eviction from cordoned nodes. Eviction behavior can be customized per namespace to accommodate different workload types and operational requirements.

Eviction Modes

NVSentinel supports three eviction modes:

Mode	Behavior	Use Case
`Immediate`	Pod evicted immediately without waiting	Fast failover for stateless workloads
`AllowCompletion`	Wait for pod to gracefully terminate	Respects terminationGracePeriodSeconds for stateful workloads
`DeleteAfterTimeout`	Wait up to timeout, then force delete	Long-running jobs that need time to checkpoint

Configuration

Configure eviction behavior in Helm values:

1 # distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml
2 # Eviction timeout in seconds for pod eviction operations
3 evictionTimeoutInSeconds: "60"
4 
5 # System namespaces are skipped during drain
6 systemNamespaces: "^(nvsentinel|kube-system|gpu-operator|gmp-system|network-operator)$"
7 
8 # Time after which pods in DeleteAfterTimeout mode will be force deleted
9 deleteAfterTimeoutMinutes: 60
10 
11 # Time after which a pod in NotReady state is considered stuck
12 notReadyTimeoutMinutes: 5
13 
14 # Per-namespace eviction configuration
15 userNamespaces:
16   # Default for all user namespaces
17   - name: "*"
18     mode: "AllowCompletion"
19   
20   # Fast failover for stateless web services
21   - name: "web-tier"
22     mode: "Immediate"
23   
24   # Allow ML training jobs to checkpoint before eviction
25   - name: "ml-training"
26     mode: "DeleteAfterTimeout"

Eviction Workflow

System Namespace Skip: Pods in system namespaces (kube-system, nvsentinel, etc.) are never evicted
Mode Selection: Eviction mode determined by namespace match (most specific wins)
Graceful Termination: Respects pod’s terminationGracePeriodSeconds for AllowCompletion mode
Timeout Handling: Force deletes stuck or timed-out pods based on configuration
NotReady Detection: Automatically force deletes pods stuck in NotReady state beyond threshold

Example: Multi-Tier Application

1 userNamespaces:
2   # Critical database - wait for graceful shutdown
3   - name: "database"
4     mode: "AllowCompletion"
5   
6   # Batch processing - allow time for checkpoint
7   - name: "batch-jobs"
8     mode: "DeleteAfterTimeout"
9   
10   # Web frontend - fast failover
11   - name: "frontend"
12     mode: "Immediate"
13   
14   # Default for everything else
15   - name: "*"
16     mode: "AllowCompletion"

Configuration Location: distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml

Topology Awareness (Topograph)

When Topograph is deployed in the cluster, it applies four node labels describing the physical network topology:

network.topology.nvidia.com/accelerator — NVLink domain (clique) ID
network.topology.nvidia.com/leaf — leaf switch identifier
network.topology.nvidia.com/spine — spine switch identifier
network.topology.nvidia.com/core — core switch identifier

These keys are included by default in the Metadata Augmentor’s allowedLabels, so NVSentinel automatically propagates them into health event metadata on clusters where Topograph has applied them. On clusters without Topograph, the labels are absent and the Metadata Augmentor simply skips them — no configuration change is required either way.

Downstream consumers of NVSentinel events (fault-quarantine CEL rules, remediation custom resources, dashboards, blast-radius analysis) can then reason about topological locality. For example, a CEL rule can compare the network.topology.nvidia.com/accelerator value across a set of recent events to determine whether a fault is isolated to a single NVLink domain or spans multiple.

The authoritative reference for these labels — value semantics, hashing behavior for long identifiers, and provider matrix — is topograph’s docs/reference/node-labels.md.

Error Code Mapping Reference

NVSentinel maps DCGM error codes to recommended actions using a canonical CSV file.

Mapping File: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv

Recommended Actions

Action	Meaning	Typical Resolution
`RESTART_VM`	Software-recoverable error	Node reboot via janitor
`COMPONENT_RESET`	Hardware reset required	GPU/driver reset
`CONTACT_SUPPORT`	Manual intervention needed	Create support ticket, manual investigation
`NONE`	Health check informational	No action required

Example Mappings

DCGM Error Code	Recommended Action	Typical Condition
`DCGM_FR_FAULTY_MEMORY`	`CONTACT_SUPPORT`	`GpuMemoryError`
`DCGM_FR_VOLATILE_DBE_DETECTED`	`COMPONENT_RESET`	`GpuMemoryError`
`DCGM_FR_NVLINK_DOWN`	`RESTART_VM`	`NVLinkDown`
`DCGM_FR_NVSWITCH_FATAL_ERROR`	`CONTACT_SUPPORT`	`NVSwitchFatalError`
`DCGM_FR_CLOCK_THROTTLE_THERMAL`	`NONE`	`GpuThermalWatch`
`DCGM_FR_SXID_ERROR`	`RESTART_VM`	`GpuXidError`

Full mapping contains 121 error codes. See CSV file for complete reference.

ADR-003: Rule-Based Node Quarantine - CEL-based quarantine rules
ADR-009: Fault Remediation Triggering - Remediation workflow
Data Flow Documentation - End-to-end event flow
Helm Chart Configuration - Deployment configuration

Node Status Examples

Example 1: Node with Fatal GPU XID Error (With Optional Taint)

1 apiVersion: v1
2 kind: Node
3 metadata:
4   name: gpu-node-01
5 spec:
6   unschedulable: true  # Cordoned (enabled by default)
7   taints:
8     # Optional - only present if configured in rulesets
9     - key: "nvidia.com/gpu-xid-error"
10       value: "true"
11       effect: "NoSchedule"
12 status:
13   conditions:
14     - type: Ready
15       status: "False"
16       reason: "GpuHealthCheckFailed"
17       message: "GPU health check failed"
18     - type: SysLogsXIDError
19       status: "True"
20       reason: "HardwareFailure"
21       message: "[DCGM_FR_SXID_ERROR] GPU XID error detected on GPU 0 - RecommendedAction: RESTART_VM"
22       lastTransitionTime: "2025-11-06T10:00:00Z"

Example 2: Node with Non-Fatal GPU Thermal Issue

1 apiVersion: v1
2 kind: Node
3 metadata:
4   name: gpu-node-02
5 spec:
6   # May or may not be cordoned depending on ruleset configuration
7   taints:
8     # Optional - only present if configured in rulesets
9     - key: "nvidia.com/gpu-thermal"
10       value: "true"
11       effect: "PreferNoSchedule"
12 status:
13   conditions:
14     - type: Ready
15       status: "True"
16     - type: GpuThermalWatch
17       status: "True"
18       reason: "ThermalThrottling"
19       message: "[DCGM_FR_CLOCK_THROTTLE_THERMAL] GPU thermal throttling detected - RecommendedAction: NONE"
20       lastTransitionTime: "2025-11-06T10:02:00Z"

Example 3: Healthy Node

1 apiVersion: v1
2 kind: Node
3 metadata:
4   name: gpu-node-03
5 status:
6   conditions:
7     - type: Ready
8       status: "True"
9     - type: GpuMemWatch
10       status: "False"
11       reason: "HealthCheckPassed"
12       message: "GPU memory health check passed"
13       lastTransitionTime: "2025-11-06T10:10:00Z"
14     - type: GpuThermalWatch
15       status: "False"
16       reason: "HealthCheckPassed"
17       message: "GPU thermal health check passed"
18       lastTransitionTime: "2025-11-06T10:10:00Z"

Implementation Notes

Module Responsibilities

Module	Responsibility	What It Sets
Platform Connectors	Process health events, update node status	NodeConditions
Fault Quarantine	Apply operational policies	Taints, cordon status
Node Drainer	Evict workloads	Drain nodes
Fault Remediation	Trigger maintenance	Create maintenance CRs

Configuration Files

Error Mapping: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv
Quarantine Rules: distros/kubernetes/nvsentinel/charts/fault-quarantine/values.yaml
Module Config: distros/kubernetes/nvsentinel/values.yaml

Code Locations

Condition Setting: platform-connectors/pkg/connectors/kubernetes/process_node_events.go
Taint Application: fault-quarantine/pkg/informer/k8s_client.go
Drain Logic: node-drainer/pkg/drainer/drainer.go
Remediation Triggering: fault-remediation/pkg/remediation/remediation.go

ADR-003: Rule-Based Node Quarantine - CEL-based quarantine rules
ADR-009: Fault Remediation Triggering - Remediation workflow
Data Flow Documentation - End-to-end event flow
Helm Chart Configuration - Deployment configuration

Contributing

This document describes the proposed API contract for NVSentinel node health signaling. Changes to condition types, taint keys, or label keys require review and follow the deprecation policy.

To propose changes:

Open an issue describing the use case
Discuss impact on external integrations
Follow the versioning and deprecation guidelines
Update this document as part of the PR

Integration Model

Quick Start

Architecture

1. Is a Node Bad? Check Taints

Taint Structure

Default Taint Examples

Taint Effect Guidelines

Configuring Taints

Integration Patterns

2. Why is a Node Bad? Check Node Conditions

Monitoring with kube-state-metrics

Condition Structure

Condition vs Event Behavior

Using Events for Non-Fatal Errors

Condition Status

Condition Message Format

Standard Condition Types

GPU Conditions (from GPU Health Monitor - DCGM)

Syslog Conditions (from Syslog Health Monitor)

NVSwitch Conditions

System Conditions

Integration Patterns

3. Can I Use My Own Remediation? Provide a Custom Resource

Integration Architecture

Configuration

Custom Resource Template

RecommendedAction Codes

Integration Examples

Example 1: Janitor Controller Integration

Example 2: Cloud Provider Integration

Example 3: DCIM Integration

Completion Detection

Testing Your Integration

4. How Do I Customize Drain Behavior? Configure Eviction Modes

Eviction Modes

Configuration

Eviction Workflow

Example: Multi-Tier Application

Topology Awareness (Topograph)

Error Code Mapping Reference

Recommended Actions

Example Mappings

Related Documentation

Node Status Examples

Example 1: Node with Fatal GPU XID Error (With Optional Taint)

Example 2: Node with Non-Fatal GPU Thermal Issue

Example 3: Healthy Node

Implementation Notes

Module Responsibilities

Configuration Files

Code Locations

Related Documentation

Contributing