NVSentinel Metrics
This document outlines all Prometheus metrics exposed by NVSentinel components.
Table of Contents
- Fault Quarantine Module
- Node Drainer Module
- Fault Remediation Module
- Labeler Module
- Janitor
- Platform Connectors
- Health Monitors
Fault Quarantine Module
Event Processing Metrics
Node Quarantine Metrics
Taint and Cordon Metrics
Ruleset Evaluation Metrics
Circuit Breaker Metrics
Node Drainer Module
Event Processing Metrics
Node Draining Metrics
Fault Remediation Module
Event Processing Metrics
Log Collector Metrics
File Server Metrics
HTTP Request Metrics
Log Rotation Metrics
Labeler Module
Event Processing Metrics
Janitor
Action Metrics
Platform Connectors
Kubernetes Connector Metrics
Workqueue Metrics
These metrics track the internal ring buffer workqueue performance:
Note: <name> in the metric names is replaced with the actual workqueue name at runtime.
Health Monitors
GPU Health Monitor
These metrics track GPU health events detected via DCGM (Data Center GPU Manager):
Syslog Health Monitor
The syslog health monitor tracks GPU-related errors detected from system logs.
XID Error Metrics
XID (GPU Error ID) errors are NVIDIA GPU driver errors:
SXID Error Metrics
SXID errors are NVSwitch-related errors:
GPU Fallen Off Bus Metrics
CSP Health Monitor
The CSP health monitor tracks cloud provider maintenance events and node health issues.
CSP Client Metrics
Event Processing Metrics
Datastore Metrics
Trigger Engine Metrics
Metrics Configuration
Scraping Metrics
All NVSentinel components expose Prometheus metrics on a metrics endpoint (typically :2112/metrics). The metrics can be scraped by Prometheus using standard scrape configurations.
Helm Chart Configuration
The NVSentinel Helm chart automatically creates a PodMonitor resource for Prometheus Operator integration:
The PodMonitor is configured to scrape all NVSentinel component pods on their metrics endpoints (/metrics on port metrics).
Annotation-based Discovery
Components can be configured to include Prometheus scrape annotations:
Metric Types Reference
- Counter: A cumulative metric that only increases or resets to zero on restart
- Gauge: A metric that can arbitrarily go up and down
- Histogram: Samples observations and counts them in configurable buckets
- Summary: Similar to histogram but calculates configurable quantiles over a sliding time window
Common Label Values
Status Labels
success/failed- Operation outcomestarted/succeeded/failed- Action lifecycle status
Action Types
reboot- Node reboot actionterminate- Node termination action
CSP Labels
gcp- Google Cloud Platformaws- Amazon Web Services
Trigger Types
quarantine- Node quarantine triggerhealthy- Node healthy trigger