NVSentinel Integration Guide

View as Markdown

NVSentinel detects GPU and hardware failures and exposes them using standard Kubernetes primitives. This document provides a high level overview of how to integrate with NVSentinel for scheduling, monitoring, and remediation purposes.

Integration Model

Think of NVSentinel integration in four layers:

  1. Is a node bad? → Check Taints

    • Taints mark nodes with hardware issues
    • Use taints for scheduling decisions and filtering
    • React to taint presence/absence in automation
  2. Why is a node bad? → Check Node Conditions

    • Conditions provide detailed diagnostic information
    • Use conditions for monitoring, alerting, and dashboards
    • Each condition explains what hardware component failed
  3. Can I use my own remediation? → Provide a Custom Resource

    • NVSentinel triggers external systems via CRs
    • Integrate with cloud APIs, DCIM, or custom controllers
    • You retain full control over how nodes are repaired
  4. How do I customize drain behavior? → Configure per-namespace eviction modes

    • Control how workloads are evicted from failing nodes
    • Define different policies for stateless vs stateful workloads
    • Set timeouts and grace periods per namespace
  5. Should GPU pods run diagnostics before the workload starts? → Enable Preflight

    • Opt-in per namespace; webhook injects init-container checks (DCGM, optional NCCL)
    • Multi-node jobs use gang discovery (native Workload API or PodGroup-style schedulers like Volcano and Run:ai)
    • Separate from the MongoDB health-event pipeline (see Data flow)

Quick Start

For Scheduling Decisions:

Find nodes with NVSentinel taints (if configured):

$kubectl get nodes -o json | jq '.items[]
> | select(.spec.taints[]?
> | select(.key | startswith("nvidia.com/")))
> | .metadata.name'

For Monitoring:

Get detailed failure information:

$kubectl get nodes -o json | jq '.items[].status.conditions[]
> | select(.type | startswith("Gpu"))'

For Pod Tolerations:

1# Match the taint configured in your fault-quarantine rulesets
2tolerations:
3 - key: "nvidia.com/gpu-xid-error"
4 operator: "Equal"
5 value: "true"
6 effect: "NoSchedule"

Architecture

The DATA_FLOW.md provides more context on this, at the higher level though, NVSentinel detects hardware failures and applies graduated responses via:

  1. Detection: Health monitors check GPU, system logs, and cloud maintenance events
  2. Classification: Platform connectors validate and set node conditions
  3. Quarantine: Fault quarantine evaluates rules and applies taints/cordons
  4. Evacuation: Node drainer evicts workloads per configured policies
  5. Remediation: Fault remediation triggers external systems via CRs
┌─────────────────────┐
│ Health Monitors │ GPU, Syslog, CSP health detection
└──────────┬──────────┘
│ Detect failures
┌─────────────────────┐
│ Platform Connectors │ Set NodeConditions (why is it bad?)
└──────────┬──────────┘
┌─────────────────────┐
│ Fault Quarantine │ Apply Cordon/Taints (node is bad)
└──────────┬──────────┘
┌─────────────────────┐
│ Node Drainer │ Evict workloads per policy
└──────────┬──────────┘
┌─────────────────────┐
│ Fault Remediation │ Trigger external systems (your CR)
└─────────────────────┘

1. Is a Node Bad? Check Taints

Use taints for all scheduling and automation decisions.

Taints are the primary signal that a node has hardware issues. External systems should watch for taint presence/absence to make scheduling decisions, trigger alerts, or initiate remediation workflows.

Note: Taints are optional and disabled by default. You must configure them in fault-quarantine rulesets by uncommenting the taint section. NVSentinel only cordons nodes by default.

Taint Structure

Format: User-configurable via rulesets. Common patterns:

Option 1: Component-specific (recommended)

nvidia.com/gpu-xid-error
nvidia.com/gpu-nvlink-error
nvidia.com/syslog-xid-error

Option 2: Hierarchical (proposed pattern)

gpu.health/memory-error
nvlink.health/link-down
nvswitch.health/fatal-error

Default Taint Examples

NVSentinel’s test suite demonstrates these taint configurations:

Taint KeyValueEffectUse Case
nvidia.com/gpu-xid-errortrueNoScheduleGPU XID critical errors
nvidia.com/gpu-nvlink-errortrueNoScheduleNVLink connection failures
nvidia.com/syslog-xid-errortrueNoScheduleSyslog-detected XID errors
nvidia.com/gpu-errortrueNoScheduleGeneric GPU hardware errors

You can configure any taint keys/values in your rulesets based on your needs.

Taint Effect Guidelines

EffectUse CaseImpact
NoScheduleFatal errors requiring remediationNew pods without toleration won’t be scheduled
PreferNoScheduleDegraded state or warningsScheduler tries to avoid but will schedule if necessary
NoExecuteImmediate evacuation neededExisting pods without toleration are evicted (rarely used)

Configuring Taints

Taints are defined in Fault Quarantine rulesets. Here’s an example showing how to enable taints:

1# distros/kubernetes/nvsentinel/charts/fault-quarantine/values.yaml
2rulesets:
3 - version: "1"
4 name: "GPU XID Critical Errors"
5 priority: 100
6 match:
7 any:
8 - kind: "HealthEvent"
9 expression: 'event.checkName == "GpuXidError" && event.isFatal == true'
10 # Uncomment to enable tainting:
11 #taint:
12 # key: "nvidia.com/gpu-xid-error" # Choose your own key format
13 # value: "true" # Or use "fatal", "degraded", etc.
14 # effect: "NoSchedule"
15 cordon:
16 shouldCordon: true # Enabled by default

Key Points:

  • Taints are commented out by default - you must enable them
  • You control the taint key format (nvidia.com/* or gpu.health/* or any custom format)
  • You control the taint values (true, fatal, degraded, etc.)
  • Cordoning is enabled by default; tainting is opt-in

Integration Patterns

Check if node has any NVIDIA-related taints:

$kubectl get nodes -o json | jq '.items[]
> | select(.spec.taints[]?
> | select(.key | startswith("nvidia.com/")))
> | .metadata.name'

Check for specific error type:

$kubectl get nodes -o json | jq '.items[]
> | select(.spec.taints[]?
> | select(.key == "nvidia.com/gpu-xid-error"))
> | .metadata.name'

Tolerate specific taints in pod specs:

1apiVersion: v1
2kind: Pod
3metadata:
4 name: gpu-workload
5spec:
6 tolerations:
7 # Match the exact taint configured in your rulesets
8 - key: "nvidia.com/gpu-xid-error"
9 operator: "Equal"
10 value: "true"
11 effect: "NoSchedule"

Watch for taint changes (automation):

1informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
2 UpdateFunc: func(oldObj, newObj interface{}) {
3 newNode := newObj.(*corev1.Node)
4
5 // Check for NVIDIA-related taints
6 for _, taint := range newNode.Spec.Taints {
7 if strings.HasPrefix(taint.Key, "nvidia.com/") {
8 // Trigger alert, update scheduler, etc.
9 log.Printf("Node %s has taint %s=%s",
10 newNode.Name, taint.Key, taint.Value)
11 }
12 }
13 },
14})

2. Why is a Node Bad? Check Node Conditions

Use node conditions for monitoring, alerting, and detailed diagnostics.

While taints tell you “this node is bad”, conditions tell you why it’s bad. Use conditions for dashboards, alerts, and troubleshooting.

Monitoring with kube-state-metrics

Prerequisites: Install kube-state-metrics to expose node conditions as Prometheus metrics. NVSentinel sets node conditions via the Kubernetes API, but kube-state-metrics is required to convert these into metrics.

Available Metrics:

1# Monitor specific GPU health conditions
2kube_node_status_condition{condition="GpuMemWatch",status="true"} == 1
3kube_node_status_condition{condition="GpuNvlinkWatch",status="true"} == 1
4kube_node_status_condition{condition="SysLogsXIDError",status="true"} == 1
5
6# Count unhealthy GPU nodes
7count(kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"})
8
9# Alert on any GPU or syslog condition
10kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"}

Example Prometheus Alert:

1- alert: GPUNodeUnhealthy
2 expr: |
3 kube_node_status_condition{condition=~"Gpu.*",status="true"} == 1
4 for: 5m
5 labels:
6 severity: critical
7 annotations:
8 summary: "GPU node {{ $labels.node }} has condition {{ $labels.condition }}"
9 description: "Node {{ $labels.node }} is unhealthy due to {{ $labels.condition }}"

Grafana Dashboard Query:

1# Show all nodes with active GPU health conditions
2kube_node_status_condition{condition=~"Gpu.*|SysLogs.*",status="true"}

Note: NVSentinel also exposes its own Prometheus metrics for internal operations. See /nvsentinel/observability/metrics-reference for the complete list of NVSentinel-native metrics.

Condition Structure

Platform Connectors set NodeConditions based on health monitor checks. Each condition explains what hardware component failed.

Naming: PascalCase, directly from health monitor check names
Examples: GpuMemWatch, GpuThermalWatch, SysLogsXIDError

Condition vs Event Behavior

NVSentinel uses different Kubernetes primitives based on error severity:

Error TypeCondition SetEvent Created?Use Case
Fatal (isFatal=true)✅ Yes (status=True)❌ NoCritical errors requiring quarantine/remediation
Non-Fatal (isFatal=false)❌ No✅ YesWarnings, transient issues, informational
Healthy (isHealthy=true)✅ Yes (status=False)❌ NoHealth recovery, condition cleared

Why this design?

  • Conditions are durable state - used for errors that require action (cordon, drain, remediation)
  • Events are transient notifications - used for warnings and non-critical issues that don’t require node isolation

Using Events for Non-Fatal Errors

Non-fatal errors (like thermal throttling warnings or transient issues) create Kubernetes Events instead of node conditions. This prevents alert fatigue while still providing visibility.

View recent events for a node:

$kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=gpu-node-01 \
> --sort-by='.lastTimestamp'

Filter for GPU-related events:

$kubectl get events --all-namespaces \
> -o json | jq '.items[] | select(.type | startswith("Gpu") or startswith("SysLogs"))'

Watch for real-time events:

$kubectl get events --watch --field-selector involvedObject.kind=Node

Example non-fatal event:

1apiVersion: v1
2kind: Event
3metadata:
4 name: gpu-node-01.17a3b2c4d5e6f7
5 namespace: default
6involvedObject:
7 kind: Node
8 name: gpu-node-01
9reason: Warning
10message: "[DCGM_FR_CLOCK_THROTTLE_THERMAL] GPU thermal throttling detected - RecommendedAction: NONE"
11type: GpuThermalWatch
12source:
13 component: gpu-health-monitor
14 host: gpu-node-01
15firstTimestamp: "2025-11-06T10:05:00Z"
16lastTimestamp: "2025-11-06T10:05:00Z"
17count: 1

Integration patterns for events:

1// Watch for non-fatal GPU events
2eventInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
3 AddFunc: func(obj interface{}) {
4 event := obj.(*corev1.Event)
5 if event.InvolvedObject.Kind == "Node" &&
6 (strings.HasPrefix(event.Type, "Gpu") || strings.HasPrefix(event.Type, "SysLogs")) {
7 // Log warning, update dashboard, etc.
8 log.Printf("Non-fatal issue on %s: %s",
9 event.InvolvedObject.Name, event.Message)
10 }
11 },
12})

Condition Status

StatusMeaning
TrueError/fault detected
FalseComponent healthy
UnknownHealth state cannot be determined

Condition Message Format

Messages include error codes and recommended actions:

[ErrorCode1, ErrorCode2] Human-readable description - RecommendedAction: ACTION_NAME

Example:

1conditions:
2 - type: GpuMemoryError
3 status: "True"
4 reason: HardwareFailure
5 message: "[DCGM_FR_FAULTY_MEMORY] GPU memory failure detected on GPU 0 - RecommendedAction: RESTART_VM"
6 lastTransitionTime: "2025-11-06T10:00:00Z"

Standard Condition Types

GPU Conditions (from GPU Health Monitor - DCGM)

  • GpuMemWatch - GPU memory failures (ECC errors, faulty memory)
  • GpuThermalWatch - Thermal throttling or temperature violations
  • GpuPcieWatch - PCIe link issues (replay rate, bandwidth)
  • GpuPowerWatch - Power-related issues
  • GpuInforomWatch - Inforom corruption detected
  • GpuSmWatch - Streaming Multiprocessor errors
  • GpuNvlinkWatch - NVLink connection failures
  • GpuMcuWatch - Microcontroller unit errors
  • GpuPmuWatch - Power management unit errors
  • GpuDriverWatch - GPU driver errors
  • GpuCpusetWatch - CPU affinity issues

Syslog Conditions (from Syslog Health Monitor)

  • SysLogsXIDError - GPU XID errors detected in system logs
  • SysLogsSXIDError - NVSwitch SXID errors detected in system logs
  • SysLogsGPUFallenOff - GPU fallen off bus errors detected in system logs

NVSwitch Conditions

  • NVSwitchFatalError - Fatal NVSwitch hardware error
  • NVSwitchDown - NVSwitch unavailable
  • NVSwitchNonFatalError - Non-fatal NVSwitch errors (warnings)

System Conditions

  • DCGMError - DCGM daemon or API failures
  • CSPMaintenance - Cloud provider scheduled maintenance
  • SyslogError - System log analysis detected issues

Integration Patterns

Monitor specific condition types:

$kubectl get nodes -o json | jq '.items[]
> | select(.status.conditions[] | select(.type=="GpuMemWatch" and .status=="True"))
> | .metadata.name'

Watch for condition changes:

$kubectl get nodes -w -o json | jq -c 'select(.status.conditions[] | select(.type | startswith("Gpu")))'

Prometheus alert example:

1groups:
2 - name: nvsentinel
3 rules:
4 - alert: GpuMemoryError
5 expr: kube_node_status_condition{condition="GpuMemWatch",status="true"} == 1
6 annotations:
7 summary: "GPU memory error on {{ $labels.node }}"

client-go example:

1informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
2 UpdateFunc: func(oldObj, newObj interface{}) {
3 newNode := newObj.(*corev1.Node)
4 for _, condition := range newNode.Status.Conditions {
5 if strings.HasPrefix(string(condition.Type), "Gpu") && condition.Status == corev1.ConditionTrue {
6 // Send alert with condition.Message
7 log.Printf("GPU issue on %s: %s", newNode.Name, condition.Message)
8 }
9 }
10 },
11})

3. Can I Use My Own Remediation? Provide a Custom Resource

NVSentinel triggers external systems by creating Kubernetes Custom Resources.

After detecting and draining a failing node, NVSentinel creates a CR that your controller watches. This gives you full control over remediation - integrate with cloud APIs, DCIM systems, or custom workflows.

Integration Architecture

┌────────────────────┐
│ Fault Remediation │ Watches drained nodes
│ Module │
└─────────┬──────────┘
│ Creates CR based on RecommendedAction
┌────────────────────┐
│ Kubernetes API │ Custom Resource created
│ (RebootNode, │
│ TerminateNode) │
└─────────┬──────────┘
│ Watched by external controller
┌────────────────────┐
│ External System │ Janitor, cloud APIs, DCIM
│ (Your Controller) │
└────────────────────┘

Configuration

Configure the maintenance CR template and behavior:

1# distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
2maintenance:
3 # API group of your maintenance CRD
4 apiGroup: "janitor.dgxc.nvidia.com"
5 version: "v1alpha1"
6 kind: "RebootNode"
7
8 # Completion condition to check before creating new CRs
9 # Prevents duplicate remediation requests for the same node
10 completeConditionType: "NodeReady"
11
12 # Namespace where maintenance CRs will be created
13 namespace: "nvsentinel"
14
15 # Resource names for RBAC permissions
16 resourceNames:
17 - "rebootnodes"
18 - "terminatenodes"
19
20 # Go template for generating maintenance CRs
21 # Available variables: .ApiGroup, .Version, .RecommendedAction, .NodeName, .HealthEventID
22 template: |
23 apiVersion: {{ .ApiGroup }}/{{ .Version }}
24 kind: {{ if eq .RecommendedAction 2 }}RebootNode{{ else }}TerminateNode{{ end }}
25 metadata:
26 name: maintenance-{{ .NodeName }}-{{ .HealthEventID }}
27 namespace: {{ .Namespace }}
28 spec:
29 nodeName: {{ .NodeName }}
30 reason: "Health event {{ .HealthEventID }}"
31 force: false
32
33# Retry configuration for CR creation
34updateRetry:
35 maxRetries: 5
36 retryDelaySeconds: 10

Custom Resource Template

The template uses Go template syntax with these variables:

VariableTypeDescription
.ApiGroupstringAPI group from maintenance.apiGroup
.VersionstringAPI version from maintenance.version
.KindstringResource kind from maintenance.kind
.RecommendedActionintNumeric action code (2=reboot, 15=terminate)
.NodeNamestringName of the node requiring remediation
.HealthEventIDstringUnique ID of the triggering health event
.NamespacestringNamespace from maintenance.namespace

RecommendedAction Codes

CodeActionTypical Use Case
2COMPONENT_RESETGPU/driver reset, reboot node
5CONTACT_SUPPORTManual intervention needed
15RESTART_VMReboot VM instance
24RESTART_BMReboot bare metal node
25REPLACE_VMTerminate and replace VM

Integration Examples

Example 1: Janitor Controller Integration

Janitor controller watches for RebootNode and TerminateNode CRs:

1apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2kind: RebootNode
3metadata:
4 name: maintenance-gpu-node-01-673bac8e9f1234567890abcd
5 namespace: nvsentinel
6spec:
7 nodeName: gpu-node-01
8 reason: "Health event 673bac8e9f1234567890abcd"
9 force: false
10status:
11 conditions:
12 - type: NodeReady
13 status: "False"
14 reason: "RebootInProgress"

Example 2: Cloud Provider Integration

Custom template for cloud-specific maintenance:

1maintenance:
2 apiGroup: "cloud.example.com"
3 version: "v1"
4 kind: "NodeMaintenance"
5 template: |
6 apiVersion: {{ .ApiGroup }}/{{ .Version }}
7 kind: NodeMaintenance
8 metadata:
9 name: {{ .NodeName }}-{{ .HealthEventID }}
10 spec:
11 nodeName: {{ .NodeName }}
12 action: {{ if eq .RecommendedAction 2 }}"reboot"{{ else if eq .RecommendedAction 15 }}"restart"{{ else }}"replace"{{ end }}
13 provider:
14 region: "us-west-2"
15 instanceId: "{{ .NodeName }}"

Example 3: DCIM Integration

Template for data center infrastructure management:

1maintenance:
2 apiGroup: "dcim.example.com"
3 version: "v1alpha1"
4 kind: "ServerMaintenance"
5 template: |
6 apiVersion: {{ .ApiGroup }}/{{ .Version }}
7 kind: ServerMaintenance
8 metadata:
9 name: server-{{ .NodeName }}
10 spec:
11 serverName: {{ .NodeName }}
12 maintenanceType: {{ if eq .RecommendedAction 2 }}"reboot"{{ else }}"replace"{{ end }}
13 priority: "high"
14 ticketId: "HEALTH-{{ .HealthEventID }}"

Completion Detection

Fault Remediation checks the completeConditionType status on existing CRs before creating new ones:

  • Status: True - Maintenance completed successfully, new CR can be created
  • Status: False - Maintenance failed, new CR can be created for retry
  • Condition Missing - Maintenance in progress, skip CR creation

This prevents duplicate remediation requests for nodes with ongoing maintenance.

Testing Your Integration

  1. Validate Template Syntax:

    $# Dry-run mode to validate template without creating CRs
    $helm install nvsentinel --set global.dryRun=true ...
  2. Monitor CR Creation:

    $# Watch for maintenance CRs
    $kubectl get rebootnodes -n nvsentinel -w
  3. Check Fault Remediation Logs:

    $kubectl logs -n nvsentinel deployment/fault-remediation -f

Configuration Location: distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml

4. How Do I Customize Drain Behavior? Configure Eviction Modes

Control how workloads are evicted from failing nodes.

The Node Drainer module handles graceful workload eviction from cordoned nodes. Eviction behavior can be customized per namespace to accommodate different workload types and operational requirements.

Eviction Modes

NVSentinel supports three eviction modes:

ModeBehaviorUse Case
ImmediatePod evicted immediately without waitingFast failover for stateless workloads
AllowCompletionWait for pod to gracefully terminateRespects terminationGracePeriodSeconds for stateful workloads
DeleteAfterTimeoutWait up to timeout, then force deleteLong-running jobs that need time to checkpoint

Configuration

Configure eviction behavior in Helm values:

1# distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml
2# Eviction timeout in seconds for pod eviction operations
3evictionTimeoutInSeconds: "60"
4
5# System namespaces are skipped during drain
6systemNamespaces: "^(nvsentinel|kube-system|gpu-operator|gmp-system|network-operator)$"
7
8# Time after which pods in DeleteAfterTimeout mode will be force deleted
9deleteAfterTimeoutMinutes: 60
10
11# Time after which a pod in NotReady state is considered stuck
12notReadyTimeoutMinutes: 5
13
14# Per-namespace eviction configuration
15userNamespaces:
16 # Default for all user namespaces
17 - name: "*"
18 mode: "AllowCompletion"
19
20 # Fast failover for stateless web services
21 - name: "web-tier"
22 mode: "Immediate"
23
24 # Allow ML training jobs to checkpoint before eviction
25 - name: "ml-training"
26 mode: "DeleteAfterTimeout"

Eviction Workflow

  1. System Namespace Skip: Pods in system namespaces (kube-system, nvsentinel, etc.) are never evicted
  2. Mode Selection: Eviction mode determined by namespace match (most specific wins)
  3. Graceful Termination: Respects pod’s terminationGracePeriodSeconds for AllowCompletion mode
  4. Timeout Handling: Force deletes stuck or timed-out pods based on configuration
  5. NotReady Detection: Automatically force deletes pods stuck in NotReady state beyond threshold

Example: Multi-Tier Application

1userNamespaces:
2 # Critical database - wait for graceful shutdown
3 - name: "database"
4 mode: "AllowCompletion"
5
6 # Batch processing - allow time for checkpoint
7 - name: "batch-jobs"
8 mode: "DeleteAfterTimeout"
9
10 # Web frontend - fast failover
11 - name: "frontend"
12 mode: "Immediate"
13
14 # Default for everything else
15 - name: "*"
16 mode: "AllowCompletion"

Configuration Location: distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml

Topology Awareness (Topograph)

When Topograph is deployed in the cluster, it applies four node labels describing the physical network topology:

  • network.topology.nvidia.com/accelerator — NVLink domain (clique) ID
  • network.topology.nvidia.com/leaf — leaf switch identifier
  • network.topology.nvidia.com/spine — spine switch identifier
  • network.topology.nvidia.com/core — core switch identifier

These keys are included by default in the Metadata Augmentor’s allowedLabels, so NVSentinel automatically propagates them into health event metadata on clusters where Topograph has applied them. On clusters without Topograph, the labels are absent and the Metadata Augmentor simply skips them — no configuration change is required either way.

Downstream consumers of NVSentinel events (fault-quarantine CEL rules, remediation custom resources, dashboards, blast-radius analysis) can then reason about topological locality. For example, a CEL rule can compare the network.topology.nvidia.com/accelerator value across a set of recent events to determine whether a fault is isolated to a single NVLink domain or spans multiple.

The authoritative reference for these labels — value semantics, hashing behavior for long identifiers, and provider matrix — is topograph’s docs/reference/node-labels.md.

Error Code Mapping Reference

NVSentinel maps DCGM error codes to recommended actions using a canonical CSV file.

Mapping File: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv

ActionMeaningTypical Resolution
RESTART_VMSoftware-recoverable errorNode reboot via janitor
COMPONENT_RESETHardware reset requiredGPU/driver reset
CONTACT_SUPPORTManual intervention neededCreate support ticket, manual investigation
NONEHealth check informationalNo action required

Example Mappings

DCGM Error CodeRecommended ActionTypical Condition
DCGM_FR_FAULTY_MEMORYCONTACT_SUPPORTGpuMemoryError
DCGM_FR_VOLATILE_DBE_DETECTEDCOMPONENT_RESETGpuMemoryError
DCGM_FR_NVLINK_DOWNRESTART_VMNVLinkDown
DCGM_FR_NVSWITCH_FATAL_ERRORCONTACT_SUPPORTNVSwitchFatalError
DCGM_FR_CLOCK_THROTTLE_THERMALNONEGpuThermalWatch
DCGM_FR_SXID_ERRORRESTART_VMGpuXidError

Full mapping contains 121 error codes. See CSV file for complete reference.

Node Status Examples

Example 1: Node with Fatal GPU XID Error (With Optional Taint)

1apiVersion: v1
2kind: Node
3metadata:
4 name: gpu-node-01
5spec:
6 unschedulable: true # Cordoned (enabled by default)
7 taints:
8 # Optional - only present if configured in rulesets
9 - key: "nvidia.com/gpu-xid-error"
10 value: "true"
11 effect: "NoSchedule"
12status:
13 conditions:
14 - type: Ready
15 status: "False"
16 reason: "GpuHealthCheckFailed"
17 message: "GPU health check failed"
18 - type: SysLogsXIDError
19 status: "True"
20 reason: "HardwareFailure"
21 message: "[DCGM_FR_SXID_ERROR] GPU XID error detected on GPU 0 - RecommendedAction: RESTART_VM"
22 lastTransitionTime: "2025-11-06T10:00:00Z"

Example 2: Node with Non-Fatal GPU Thermal Issue

1apiVersion: v1
2kind: Node
3metadata:
4 name: gpu-node-02
5spec:
6 # May or may not be cordoned depending on ruleset configuration
7 taints:
8 # Optional - only present if configured in rulesets
9 - key: "nvidia.com/gpu-thermal"
10 value: "true"
11 effect: "PreferNoSchedule"
12status:
13 conditions:
14 - type: Ready
15 status: "True"
16 - type: GpuThermalWatch
17 status: "True"
18 reason: "ThermalThrottling"
19 message: "[DCGM_FR_CLOCK_THROTTLE_THERMAL] GPU thermal throttling detected - RecommendedAction: NONE"
20 lastTransitionTime: "2025-11-06T10:02:00Z"

Example 3: Healthy Node

1apiVersion: v1
2kind: Node
3metadata:
4 name: gpu-node-03
5status:
6 conditions:
7 - type: Ready
8 status: "True"
9 - type: GpuMemWatch
10 status: "False"
11 reason: "HealthCheckPassed"
12 message: "GPU memory health check passed"
13 lastTransitionTime: "2025-11-06T10:10:00Z"
14 - type: GpuThermalWatch
15 status: "False"
16 reason: "HealthCheckPassed"
17 message: "GPU thermal health check passed"
18 lastTransitionTime: "2025-11-06T10:10:00Z"

Implementation Notes

Module Responsibilities

ModuleResponsibilityWhat It Sets
Platform ConnectorsProcess health events, update node statusNodeConditions
Fault QuarantineApply operational policiesTaints, cordon status
Node DrainerEvict workloadsDrain nodes
Fault RemediationTrigger maintenanceCreate maintenance CRs

Configuration Files

  • Error Mapping: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv
  • Quarantine Rules: distros/kubernetes/nvsentinel/charts/fault-quarantine/values.yaml
  • Module Config: distros/kubernetes/nvsentinel/values.yaml

Code Locations

  • Condition Setting: platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • Taint Application: fault-quarantine/pkg/informer/k8s_client.go
  • Drain Logic: node-drainer/pkg/drainer/drainer.go
  • Remediation Triggering: fault-remediation/pkg/remediation/remediation.go

Contributing

This document describes the proposed API contract for NVSentinel node health signaling. Changes to condition types, taint keys, or label keys require review and follow the deprecation policy.

To propose changes:

  1. Open an issue describing the use case
  2. Discuss impact on external integrations
  3. Follow the versioning and deprecation guidelines
  4. Update this document as part of the PR