NVSentinel detects GPU and hardware failures and exposes them using standard Kubernetes primitives. This document provides a high level overview of how to integrate with NVSentinel for scheduling, monitoring, and remediation purposes.
Think of NVSentinel integration in four layers:
Is a node bad? → Check Taints
Why is a node bad? → Check Node Conditions
Can I use my own remediation? → Provide a Custom Resource
How do I customize drain behavior? → Configure per-namespace eviction modes
Should GPU pods run diagnostics before the workload starts? → Enable Preflight
For Scheduling Decisions:
Find nodes with NVSentinel taints (if configured):
For Monitoring:
Get detailed failure information:
For Pod Tolerations:
The DATA_FLOW.md provides more context on this, at the higher level though, NVSentinel detects hardware failures and applies graduated responses via:
Use taints for all scheduling and automation decisions.
Taints are the primary signal that a node has hardware issues. External systems should watch for taint presence/absence to make scheduling decisions, trigger alerts, or initiate remediation workflows.
Note: Taints are optional and disabled by default. You must configure them in fault-quarantine rulesets by uncommenting the
taintsection. NVSentinel only cordons nodes by default.
Format: User-configurable via rulesets. Common patterns:
Option 1: Component-specific (recommended)
Option 2: Hierarchical (proposed pattern)
NVSentinel’s test suite demonstrates these taint configurations:
You can configure any taint keys/values in your rulesets based on your needs.
Taints are defined in Fault Quarantine rulesets. Here’s an example showing how to enable taints:
Key Points:
nvidia.com/* or gpu.health/* or any custom format)true, fatal, degraded, etc.)Check if node has any NVIDIA-related taints:
Check for specific error type:
Tolerate specific taints in pod specs:
Watch for taint changes (automation):
Use node conditions for monitoring, alerting, and detailed diagnostics.
While taints tell you “this node is bad”, conditions tell you why it’s bad. Use conditions for dashboards, alerts, and troubleshooting.
Prerequisites: Install kube-state-metrics to expose node conditions as Prometheus metrics. NVSentinel sets node conditions via the Kubernetes API, but kube-state-metrics is required to convert these into metrics.
Available Metrics:
Example Prometheus Alert:
Grafana Dashboard Query:
Note: NVSentinel also exposes its own Prometheus metrics for internal operations. See /nvsentinel/observability/metrics-reference for the complete list of NVSentinel-native metrics.
Platform Connectors set NodeConditions based on health monitor checks. Each condition explains what hardware component failed.
Naming: PascalCase, directly from health monitor check names
Examples: GpuMemWatch, GpuThermalWatch, SysLogsXIDError
NVSentinel uses different Kubernetes primitives based on error severity:
Why this design?
Non-fatal errors (like thermal throttling warnings or transient issues) create Kubernetes Events instead of node conditions. This prevents alert fatigue while still providing visibility.
View recent events for a node:
Filter for GPU-related events:
Watch for real-time events:
Example non-fatal event:
Integration patterns for events:
Messages include error codes and recommended actions:
Example:
GpuMemWatch - GPU memory failures (ECC errors, faulty memory)GpuThermalWatch - Thermal throttling or temperature violationsGpuPcieWatch - PCIe link issues (replay rate, bandwidth)GpuPowerWatch - Power-related issuesGpuInforomWatch - Inforom corruption detectedGpuSmWatch - Streaming Multiprocessor errorsGpuNvlinkWatch - NVLink connection failuresGpuMcuWatch - Microcontroller unit errorsGpuPmuWatch - Power management unit errorsGpuDriverWatch - GPU driver errorsGpuCpusetWatch - CPU affinity issuesSysLogsXIDError - GPU XID errors detected in system logsSysLogsSXIDError - NVSwitch SXID errors detected in system logsSysLogsGPUFallenOff - GPU fallen off bus errors detected in system logsNVSwitchFatalError - Fatal NVSwitch hardware errorNVSwitchDown - NVSwitch unavailableNVSwitchNonFatalError - Non-fatal NVSwitch errors (warnings)DCGMError - DCGM daemon or API failuresCSPMaintenance - Cloud provider scheduled maintenanceSyslogError - System log analysis detected issuesMonitor specific condition types:
Watch for condition changes:
Prometheus alert example:
client-go example:
NVSentinel triggers external systems by creating Kubernetes Custom Resources.
After detecting and draining a failing node, NVSentinel creates a CR that your controller watches. This gives you full control over remediation - integrate with cloud APIs, DCIM systems, or custom workflows.
Configure the maintenance CR template and behavior:
The template uses Go template syntax with these variables:
Janitor controller watches for RebootNode and TerminateNode CRs:
Custom template for cloud-specific maintenance:
Template for data center infrastructure management:
Fault Remediation checks the completeConditionType status on existing CRs before creating new ones:
This prevents duplicate remediation requests for nodes with ongoing maintenance.
Validate Template Syntax:
Monitor CR Creation:
Check Fault Remediation Logs:
Configuration Location: distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
Control how workloads are evicted from failing nodes.
The Node Drainer module handles graceful workload eviction from cordoned nodes. Eviction behavior can be customized per namespace to accommodate different workload types and operational requirements.
NVSentinel supports three eviction modes:
Configure eviction behavior in Helm values:
terminationGracePeriodSeconds for AllowCompletion modeConfiguration Location: distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml
When Topograph is deployed in the cluster, it applies four node labels describing the physical network topology:
network.topology.nvidia.com/accelerator — NVLink domain (clique) IDnetwork.topology.nvidia.com/leaf — leaf switch identifiernetwork.topology.nvidia.com/spine — spine switch identifiernetwork.topology.nvidia.com/core — core switch identifierThese keys are included by default in the Metadata Augmentor’s allowedLabels, so NVSentinel automatically propagates them into health event metadata on clusters where Topograph has applied them. On clusters without Topograph, the labels are absent and the Metadata Augmentor simply skips them — no configuration change is required either way.
Downstream consumers of NVSentinel events (fault-quarantine CEL rules, remediation custom resources, dashboards, blast-radius analysis) can then reason about topological locality. For example, a CEL rule can compare the network.topology.nvidia.com/accelerator value across a set of recent events to determine whether a fault is isolated to a single NVLink domain or spans multiple.
The authoritative reference for these labels — value semantics, hashing behavior for long identifiers, and provider matrix — is topograph’s docs/reference/node-labels.md.
NVSentinel maps DCGM error codes to recommended actions using a canonical CSV file.
Mapping File: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv
Full mapping contains 121 error codes. See CSV file for complete reference.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csvdistros/kubernetes/nvsentinel/charts/fault-quarantine/values.yamldistros/kubernetes/nvsentinel/values.yamlplatform-connectors/pkg/connectors/kubernetes/process_node_events.gofault-quarantine/pkg/informer/k8s_client.gonode-drainer/pkg/drainer/drainer.gofault-remediation/pkg/remediation/remediation.goThis document describes the proposed API contract for NVSentinel node health signaling. Changes to condition types, taint keys, or label keys require review and follow the deprecation policy.
To propose changes: