NVSentinel Data Flow Documentation
This document illustrates how data flows through the NVSentinel system, from detection through remediation.
Table of Contents
- Overview
- Preflight (optional admission checks)
- Core Data Structure: HealthEvent
- Component Data Flow
- Detailed Sequence Diagrams
- Data Transformations
Overview
NVSentinel uses a publish-subscribe pattern through MongoDB change streams:
- Health Monitors detect issues and publish
HealthEventmessages via gRPC - Platform Connectors persist events to MongoDB and update Kubernetes
- Core Modules subscribe to MongoDB change streams and react independently
- Kubernetes API is the final actuator for all remediation actions
Preflight (optional admission checks)
Preflight does not publish events through MongoDB or platform connectors. A mutating admission webhook injects init containers into GPU pods in labeled namespaces. When a check detects a failure, the init container sends a health event to the platform connector over the Unix domain socket (PLATFORM_CONNECTOR_SOCKET), which then follows the normal ingestion path. This is separate from the change-stream pipeline above because healthy checks produce no events at all. Multi-node checks use gang discovery and ConfigMap coordination (ADR-026, configuration).
Core Data Structure: HealthEvent
All data flowing through NVSentinel is based on the HealthEvent protobuf message:
HealthEvent Message Structure
Example HealthEvent: GPU XID Error
Component Data Flow
1. GPU Health Monitor
What it captures:
- GPU temperature and power
- ECC errors (single-bit, double-bit)
- GPU throttling events
What it emits:
HealthEventvia gRPC to Platform Connectors- Metrics to Prometheus (separate path)
Example flow:
2. Syslog Health Monitor
What it captures:
- XID errors (GPU hardware faults)
- SXID errors (GPU software errors)
- GPU fell off the bus events
What it emits:
HealthEventvia gRPC to Platform Connectors
Example flow:
3. CSP Health Monitor
What it captures:
- Cloud provider maintenance schedules (GCP, AWS, OCI)
- Upcoming VM migrations
- Hardware replacement notices
What it emits:
HealthEventvia gRPC to Platform Connectors
Example flow:
4. Platform Connectors
What it receives:
HealthEventsmessage via gRPC (array of HealthEvent)- gRPC method:
HealthEventOccurredV1(HealthEvents) returns (Empty)
What it does:
- Validates the event (schema, required fields)
- Inserts event into MongoDB
health_eventscollection - Updates Kubernetes node condition (if applicable)
- Updates Kubernetes node events (if applicable)
What it emits:
- MongoDB document (HealthEvent serialized)
- Kubernetes Node condition update (for fatal failures)
- Kubernetes Node events (for non-fatal issues)
- Metrics to Prometheus
Data transformation:
5. Fault Quarantine Module
What it receives: What it receives:
- MongoDB change stream events
- Watches for: new HealthEvents with
isFatal: trueor specific error codes
Decision logic:
CEL Policy Evaluation:
- Uses Common Expression Language (CEL) for flexible policy definitions
- Policies can be defined per-node via annotations or cluster-wide via ConfigMap
- CEL expressions can evaluate any HealthEvent field (errorCode, componentClass, metadata, etc.)
- Example policy:
event.errorCode.contains("XID-48") || (event.componentClass == "GPU" && event.isFatal)
What it emits:
- Kubernetes API call:
PATCH /api/v1/nodes/{nodeName}- Sets
spec.unschedulable = true(cordon) - Optionally sets taints based on configuration
- Sets
- MongoDB update: Sets event
status = "QUARANTINED" - Node annotation with quarantine reason
Example API payload:
6. Node Drainer Module
What it receives:
- MongoDB change stream events
- Watches for: nodes cordoned by Quarantine Module
Decision logic:
What it emits:
- Kubernetes API calls:
GET /api/v1/pods(list pods on node)DELETE /api/v1/namespaces/{ns}/pods/{pod}(evict each pod)
- MongoDB update: Sets event
status = "DRAINED"
Eviction payload:
7. Fault Remediation Module
What it receives:
- MongoDB change stream events
- Watches for: events with specific RecommendedActions
Decision logic:
What it emits:
- Kubernetes Custom Resource (CRD):
Note: The CRD is consumed by an external operator (e.g., Janitor) that handles the actual maintenance workflow.
8. Health Events Analyzer
What it receives:
- MongoDB change stream events (all events)
What it does:
- Pattern detection (recurring errors)
- Trend analysis (error frequency increasing)
- Correlation (multiple failures on same rack)
What it emits:
- New HealthEvents (for correlated/aggregated issues)
- Aggregated metrics to Prometheus
- Alert annotations to HealthEvents
- Dashboard data
Detailed Sequence Diagrams
Scenario 1: GPU XID Error Detection to Node Quarantine
Scenario 2: Full Remediation Flow
Data Transformations
gRPC to MongoDB
Input (gRPC):
Output (MongoDB):
MongoDB Change Stream to Module
Change Stream Event:
Module receives:
- Deserializes
fullDocumentintoHealthEventstruct - Evaluates based on module-specific logic
- Takes action via Kubernetes API
HealthEvent to Kubernetes Node Condition
HealthEvent:
Kubernetes Node Condition:
HealthEvent to Kubernetes CRD
HealthEvent:
RebootNode CRD:
Data Flow Summary
Connection Methods
gRPC Connections
- Protocol: HTTP/2 + Protocol Buffers
- Service Definition:
PlatformConnector.HealthEventOccurredV1 - Client: Health Monitors (GPU, Syslog, CSP)
- Server: Platform Connectors
- Port: Configurable (default: 50051)
- TLS: Optional (cert-manager integration)
MongoDB Connections
- Write Path: Platform Connectors → MongoDB (insert)
- Read Path: All core modules ← MongoDB (change streams)
- Connection String:
mongodb://nvsentinel-mongodb:27017/nvsentinel - Collection:
health_events - Indexes:
nodeName,agent,created_at,status
Kubernetes API Connections
- Authentication: ServiceAccount tokens
- Authorization: RBAC (Roles/ClusterRoles)
- API Groups:
v1(core),policy/v1(eviction), custom CRDs - Operations: GET, PATCH, DELETE, CREATE, WATCH
Key Insights
- Decoupled Architecture: Monitors don’t know about modules, modules don’t know about monitors
- Single Data Model: HealthEvent is the universal language
- Event Sourcing: MongoDB change streams enable reactive processing
- Kubernetes-Native: Final actions all go through K8s API (auditability)
- Idempotent Operations: Modules can re-process events safely
- Metadata Rich: HealthEvent metadata field allows extensibility without schema changes