NVSentinel Integration Guide
NVSentinel detects GPU and hardware failures and exposes them using standard Kubernetes primitives. This document provides a high level overview of how to integrate with NVSentinel for scheduling, monitoring, and remediation purposes.
Integration Model
Think of NVSentinel integration in four layers:
-
Is a node bad? → Check Taints
- Taints mark nodes with hardware issues
- Use taints for scheduling decisions and filtering
- React to taint presence/absence in automation
-
Why is a node bad? → Check Node Conditions
- Conditions provide detailed diagnostic information
- Use conditions for monitoring, alerting, and dashboards
- Each condition explains what hardware component failed
-
Can I use my own remediation? → Provide a Custom Resource
- NVSentinel triggers external systems via CRs
- Integrate with cloud APIs, DCIM, or custom controllers
- You retain full control over how nodes are repaired
-
How do I customize drain behavior? → Configure per-namespace eviction modes
- Control how workloads are evicted from failing nodes
- Define different policies for stateless vs stateful workloads
- Set timeouts and grace periods per namespace
-
Should GPU pods run diagnostics before the workload starts? → Enable Preflight
- Opt-in per namespace; webhook injects init-container checks (DCGM, optional NCCL)
- Multi-node jobs use gang discovery (native Workload API or PodGroup-style schedulers like Volcano and Run:ai)
- Separate from the MongoDB health-event pipeline (see Data flow)
Quick Start
For Scheduling Decisions:
Find nodes with NVSentinel taints (if configured):
For Monitoring:
Get detailed failure information:
For Pod Tolerations:
Architecture
The DATA_FLOW.md provides more context on this, at the higher level though, NVSentinel detects hardware failures and applies graduated responses via:
- Detection: Health monitors check GPU, system logs, and cloud maintenance events
- Classification: Platform connectors validate and set node conditions
- Quarantine: Fault quarantine evaluates rules and applies taints/cordons
- Evacuation: Node drainer evicts workloads per configured policies
- Remediation: Fault remediation triggers external systems via CRs
1. Is a Node Bad? Check Taints
Use taints for all scheduling and automation decisions.
Taints are the primary signal that a node has hardware issues. External systems should watch for taint presence/absence to make scheduling decisions, trigger alerts, or initiate remediation workflows.
Note: Taints are optional and disabled by default. You must configure them in fault-quarantine rulesets by uncommenting the
taintsection. NVSentinel only cordons nodes by default.
Taint Structure
Format: User-configurable via rulesets. Common patterns:
Option 1: Component-specific (recommended)
Option 2: Hierarchical (proposed pattern)
Default Taint Examples
NVSentinel’s test suite demonstrates these taint configurations:
You can configure any taint keys/values in your rulesets based on your needs.
Taint Effect Guidelines
Configuring Taints
Taints are defined in Fault Quarantine rulesets. Here’s an example showing how to enable taints:
Key Points:
- Taints are commented out by default - you must enable them
- You control the taint key format (
nvidia.com/*orgpu.health/*or any custom format) - You control the taint values (
true,fatal,degraded, etc.) - Cordoning is enabled by default; tainting is opt-in
Integration Patterns
Check if node has any NVIDIA-related taints:
Check for specific error type:
Tolerate specific taints in pod specs:
Watch for taint changes (automation):
2. Why is a Node Bad? Check Node Conditions
Use node conditions for monitoring, alerting, and detailed diagnostics.
While taints tell you “this node is bad”, conditions tell you why it’s bad. Use conditions for dashboards, alerts, and troubleshooting.
Monitoring with kube-state-metrics
Prerequisites: Install kube-state-metrics to expose node conditions as Prometheus metrics. NVSentinel sets node conditions via the Kubernetes API, but kube-state-metrics is required to convert these into metrics.
Available Metrics:
Example Prometheus Alert:
Grafana Dashboard Query:
Note: NVSentinel also exposes its own Prometheus metrics for internal operations. See /nvsentinel/observability/metrics-reference for the complete list of NVSentinel-native metrics.
Condition Structure
Platform Connectors set NodeConditions based on health monitor checks. Each condition explains what hardware component failed.
Naming: PascalCase, directly from health monitor check names
Examples: GpuMemWatch, GpuThermalWatch, SysLogsXIDError
Condition vs Event Behavior
NVSentinel uses different Kubernetes primitives based on error severity:
Why this design?
- Conditions are durable state - used for errors that require action (cordon, drain, remediation)
- Events are transient notifications - used for warnings and non-critical issues that don’t require node isolation
Using Events for Non-Fatal Errors
Non-fatal errors (like thermal throttling warnings or transient issues) create Kubernetes Events instead of node conditions. This prevents alert fatigue while still providing visibility.
View recent events for a node:
Filter for GPU-related events:
Watch for real-time events:
Example non-fatal event:
Integration patterns for events:
Condition Status
Condition Message Format
Messages include error codes and recommended actions:
Example:
Standard Condition Types
GPU Conditions (from GPU Health Monitor - DCGM)
GpuMemWatch- GPU memory failures (ECC errors, faulty memory)GpuThermalWatch- Thermal throttling or temperature violationsGpuPcieWatch- PCIe link issues (replay rate, bandwidth)GpuPowerWatch- Power-related issuesGpuInforomWatch- Inforom corruption detectedGpuSmWatch- Streaming Multiprocessor errorsGpuNvlinkWatch- NVLink connection failuresGpuMcuWatch- Microcontroller unit errorsGpuPmuWatch- Power management unit errorsGpuDriverWatch- GPU driver errorsGpuCpusetWatch- CPU affinity issues
Syslog Conditions (from Syslog Health Monitor)
SysLogsXIDError- GPU XID errors detected in system logsSysLogsSXIDError- NVSwitch SXID errors detected in system logsSysLogsGPUFallenOff- GPU fallen off bus errors detected in system logs
NVSwitch Conditions
NVSwitchFatalError- Fatal NVSwitch hardware errorNVSwitchDown- NVSwitch unavailableNVSwitchNonFatalError- Non-fatal NVSwitch errors (warnings)
System Conditions
DCGMError- DCGM daemon or API failuresCSPMaintenance- Cloud provider scheduled maintenanceSyslogError- System log analysis detected issues
Integration Patterns
Monitor specific condition types:
Watch for condition changes:
Prometheus alert example:
client-go example:
3. Can I Use My Own Remediation? Provide a Custom Resource
NVSentinel triggers external systems by creating Kubernetes Custom Resources.
After detecting and draining a failing node, NVSentinel creates a CR that your controller watches. This gives you full control over remediation - integrate with cloud APIs, DCIM systems, or custom workflows.
Integration Architecture
Configuration
Configure the maintenance CR template and behavior:
Custom Resource Template
The template uses Go template syntax with these variables:
RecommendedAction Codes
Integration Examples
Example 1: Janitor Controller Integration
Janitor controller watches for RebootNode and TerminateNode CRs:
Example 2: Cloud Provider Integration
Custom template for cloud-specific maintenance:
Example 3: DCIM Integration
Template for data center infrastructure management:
Completion Detection
Fault Remediation checks the completeConditionType status on existing CRs before creating new ones:
- Status: True - Maintenance completed successfully, new CR can be created
- Status: False - Maintenance failed, new CR can be created for retry
- Condition Missing - Maintenance in progress, skip CR creation
This prevents duplicate remediation requests for nodes with ongoing maintenance.
Testing Your Integration
-
Validate Template Syntax:
-
Monitor CR Creation:
-
Check Fault Remediation Logs:
Configuration Location: distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
4. How Do I Customize Drain Behavior? Configure Eviction Modes
Control how workloads are evicted from failing nodes.
The Node Drainer module handles graceful workload eviction from cordoned nodes. Eviction behavior can be customized per namespace to accommodate different workload types and operational requirements.
Eviction Modes
NVSentinel supports three eviction modes:
Configuration
Configure eviction behavior in Helm values:
Eviction Workflow
- System Namespace Skip: Pods in system namespaces (kube-system, nvsentinel, etc.) are never evicted
- Mode Selection: Eviction mode determined by namespace match (most specific wins)
- Graceful Termination: Respects pod’s
terminationGracePeriodSecondsforAllowCompletionmode - Timeout Handling: Force deletes stuck or timed-out pods based on configuration
- NotReady Detection: Automatically force deletes pods stuck in NotReady state beyond threshold
Example: Multi-Tier Application
Configuration Location: distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml
Topology Awareness (Topograph)
When Topograph is deployed in the cluster, it applies four node labels describing the physical network topology:
network.topology.nvidia.com/accelerator— NVLink domain (clique) IDnetwork.topology.nvidia.com/leaf— leaf switch identifiernetwork.topology.nvidia.com/spine— spine switch identifiernetwork.topology.nvidia.com/core— core switch identifier
These keys are included by default in the Metadata Augmentor’s allowedLabels, so NVSentinel automatically propagates them into health event metadata on clusters where Topograph has applied them. On clusters without Topograph, the labels are absent and the Metadata Augmentor simply skips them — no configuration change is required either way.
Downstream consumers of NVSentinel events (fault-quarantine CEL rules, remediation custom resources, dashboards, blast-radius analysis) can then reason about topological locality. For example, a CEL rule can compare the network.topology.nvidia.com/accelerator value across a set of recent events to determine whether a fault is isolated to a single NVLink domain or spans multiple.
The authoritative reference for these labels — value semantics, hashing behavior for long identifiers, and provider matrix — is topograph’s docs/reference/node-labels.md.
Error Code Mapping Reference
NVSentinel maps DCGM error codes to recommended actions using a canonical CSV file.
Mapping File: distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv
Recommended Actions
Example Mappings
Full mapping contains 121 error codes. See CSV file for complete reference.
Related Documentation
- ADR-003: Rule-Based Node Quarantine - CEL-based quarantine rules
- ADR-009: Fault Remediation Triggering - Remediation workflow
- Data Flow Documentation - End-to-end event flow
- Helm Chart Configuration - Deployment configuration
Node Status Examples
Example 1: Node with Fatal GPU XID Error (With Optional Taint)
Example 2: Node with Non-Fatal GPU Thermal Issue
Example 3: Healthy Node
Implementation Notes
Module Responsibilities
Configuration Files
- Error Mapping:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/files/dcgmerrorsmapping.csv - Quarantine Rules:
distros/kubernetes/nvsentinel/charts/fault-quarantine/values.yaml - Module Config:
distros/kubernetes/nvsentinel/values.yaml
Code Locations
- Condition Setting:
platform-connectors/pkg/connectors/kubernetes/process_node_events.go - Taint Application:
fault-quarantine/pkg/informer/k8s_client.go - Drain Logic:
node-drainer/pkg/drainer/drainer.go - Remediation Triggering:
fault-remediation/pkg/remediation/remediation.go
Related Documentation
- ADR-003: Rule-Based Node Quarantine - CEL-based quarantine rules
- ADR-009: Fault Remediation Triggering - Remediation workflow
- Data Flow Documentation - End-to-end event flow
- Helm Chart Configuration - Deployment configuration
Contributing
This document describes the proposed API contract for NVSentinel node health signaling. Changes to condition types, taint keys, or label keys require review and follow the deprecation policy.
To propose changes:
- Open an issue describing the use case
- Discuss impact on external integrations
- Follow the versioning and deprecation guidelines
- Update this document as part of the PR