Data Flow | NVIDIA NVSentinel Documentation

This document illustrates how data flows through the NVSentinel system, from detection through remediation.

Overview

NVSentinel uses a publish-subscribe pattern through MongoDB change streams:

Health Monitors detect issues and publish HealthEvent messages via gRPC
Platform Connectors persist events to MongoDB and update Kubernetes
Core Modules subscribe to MongoDB change streams and react independently
Kubernetes API is the final actuator for all remediation actions

Preflight (optional admission checks)

Preflight does not publish events through MongoDB or platform connectors. A mutating admission webhook injects init containers into GPU pods in labeled namespaces. When a check detects a failure, the init container sends a health event to the platform connector over the Unix domain socket (PLATFORM_CONNECTOR_SOCKET), which then follows the normal ingestion path. This is separate from the change-stream pipeline above because healthy checks produce no events at all. Multi-node checks use gang discovery and ConfigMap coordination (ADR-026, configuration).

Core Data Structure: HealthEvent

All data flowing through NVSentinel is based on the HealthEvent protobuf message:

HealthEvent Message Structure

1 message HealthEvent {
2   uint32 version = 1;                          // Protocol version
3   
4   // Source identification
5   string agent = 2;                            // Monitor name (e.g., "gpu-health-monitor")
6   string componentClass = 3;                   // Component type (e.g., "GPU", "NIC")
7   string nodeName = 13;                        // Kubernetes node name
8   
9   // Health status
10   string checkName = 4;                        // Specific check (e.g., "XID_ERROR")
11   bool isFatal = 5;                           // Critical failure
12   bool isHealthy = 6;                         // Current health state
13   string message = 7;                         // Human-readable description
14   
15   // Classification
16   RecommendedAction recommendedAction = 8;    // What should be done
17   repeated string errorCode = 9;              // Error identifiers (e.g., ["XID-48"])
18   repeated Entity entitiesImpacted = 10;      // Affected resources
19   
20   // Metadata
21   map<string, string> metadata = 11;          // Key-value pairs (GPU UUID, driver version, etc.)
22   google.protobuf.Timestamp generatedTimestamp = 12;
23   
24   // Behavior overrides
25   BehaviourOverrides quarantineOverrides = 14;
26   BehaviourOverrides drainOverrides = 15;
27 }
28 
29 enum RecommendedAction {
30   NONE = 0;
31   COMPONENT_RESET = 2;
32   CONTACT_SUPPORT = 5;
33   RESTART_VM = 15;
34   RESTART_BM = 24;
35   REPLACE_VM = 25;
36   UNKNOWN = 99;
37 }
38 
39 message Entity {
40   string entityType = 1;    // e.g., "GPU", "NODE", "POD"
41   string entityValue = 2;   // e.g., GPU UUID, node name
42 }

Example HealthEvent: GPU XID Error

1 {
2   "version": 1,
3   "agent": "gpu-health-monitor",
4   "componentClass": "GPU",
5   "checkName": "XID_ERROR_48",
6   "isFatal": true,
7   "isHealthy": false,
8   "message": "GPU 0 reported XID 48 (Double Bit ECC Error)",
9   "recommendedAction": "REPLACE_VM",
10   "errorCode": ["XID-48"],
11   "entitiesImpacted": [
12     {
13       "entityType": "GPU",
14       "entityValue": "GPU-12345678-abcd-1234-abcd-123456789abc"
15     }
16   ],
17   "metadata": {
18     "gpu_uuid": "GPU-12345678-abcd-1234-abcd-123456789abc",
19     "gpu_index": "0",
20     "driver_version": "535.104.05",
21     "severity": "CRITICAL"
22   },
23   "generatedTimestamp": "2025-10-28T10:15:30Z",
24   "nodeName": "gpu-node-42",
25   "quarantineOverrides": null,
26   "drainOverrides": null
27 }

Component Data Flow

1. GPU Health Monitor

What it captures:

GPU temperature and power
ECC errors (single-bit, double-bit)
GPU throttling events

What it emits:

HealthEvent via gRPC to Platform Connectors
Metrics to Prometheus (separate path)

Example flow:

DCGM reports ECC error on GPU 0
  ↓
Monitor creates HealthEvent:
  - agent: "gpu-health-monitor"
  - componentClass: "GPU"
  - checkName: "ECC_ERROR"
  - isFatal: false
  - recommendedAction: MONITOR
  - errorCode: ["ECC-DBE"]
  ↓
Sends via gRPC: HealthEventOccurredV1(HealthEvents)

2. Syslog Health Monitor

What it captures:

XID errors (GPU hardware faults)
SXID errors (GPU software errors)
GPU fell off the bus events

What it emits:

HealthEvent via gRPC to Platform Connectors

Example flow:

journalctl shows XID 48 error
  ↓
Monitor creates HealthEvent:
  - agent: "syslog-health-monitor"
  - componentClass: "GPU"
  - checkName: "XID_ERROR_48"
  - isFatal: true
  - recommendedAction: REPLACE_VM
  - errorCode: ["XID-48"]
  ↓
Sends via gRPC

3. CSP Health Monitor

What it captures:

Cloud provider maintenance schedules (GCP, AWS, OCI)
Upcoming VM migrations
Hardware replacement notices

What it emits:

HealthEvent via gRPC to Platform Connectors

Example flow:

GCP API reports scheduled maintenance
  ↓
Monitor creates HealthEvent:
  - agent: "csp-health-monitor"
  - componentClass: "CSP"
  - checkName: "SCHEDULED_MAINTENANCE"
  - isFatal: false
  - recommendedAction: NONE
  - metadata: {"maintenance_start": "2025-11-01T00:00:00Z"}
  ↓
Sends via gRPC

4. Platform Connectors

What it receives:

HealthEvents message via gRPC (array of HealthEvent)
gRPC method: HealthEventOccurredV1(HealthEvents) returns (Empty)

What it does:

Validates the event (schema, required fields)
Inserts event into MongoDB health_events collection
Updates Kubernetes node condition (if applicable)
Updates Kubernetes node events (if applicable)

What it emits:

MongoDB document (HealthEvent serialized)
Kubernetes Node condition update (for fatal failures)
Kubernetes Node events (for non-fatal issues)
Metrics to Prometheus

Data transformation:

1 gRPC HealthEvents
2   ↓
3 Validate each HealthEvent
4   ↓
5 MongoDB Insert: {
6   "_id": ObjectId("..."),
7   "createdAt": ISODate("2025-10-28T10:15:30Z"),
8   "healthevent": {
9     "version": 1,
10     "agent": "gpu-health-monitor",
11     "nodeName": "gpu-node-42",
12     // ... all other HealthEvent fields
13   },
14   "healtheventstatus": {
15     "nodequarantined": null,                    // null, "Quarantined", "UnQuarantined", or "AlreadyQuarantined"
16     "userpodsevictionstatus": {
17       "status": "NotStarted",                   // "NotStarted", "InProgress", "Failed", "Succeeded", or "AlreadyDrained"
18       "message": ""                             // Optional error/status message
19     },
20     "faultremediated": null,                    // null or boolean
21     "lastremediationtimestamp": null            // null or ISODate
22   }
23 }
24   ↓
25 If isFatal == true:
26   Kubernetes Node Condition: {
27     "type": "GPUHealthy",
28     "status": "False",
29     "reason": "XID_ERROR_48",
30     "message": "GPU 0 reported XID 48"
31   }
32 Else:
33   Kubernetes Node Event: {
34     "type": "Warning",
35     "reason": "GPUHealthIssue",
36     "message": "GPU 0 reported ECC error",
37     "involvedObject": {Node}
38   }

5. Fault Quarantine Module

What it receives: What it receives:

MongoDB change stream events
Watches for: new HealthEvents with isFatal: true or specific error codes

Decision logic:

1 // Evaluate CEL-based policy first
2 policy := getCELPolicy(event.NodeName)
3 if policy.Evaluate(event) {
4   // CEL policy determines if quarantine is needed
5   if !event.QuarantineOverrides.Skip {
6     cordon node
7   }
8 }
9 
10 // Fallback to built-in logic
11 if event.IsFatal || event.RecommendedAction == REPLACE_VM {
12   if !event.QuarantineOverrides.Skip {
13     cordon node
14   }
15 }

CEL Policy Evaluation:

Uses Common Expression Language (CEL) for flexible policy definitions
Policies can be defined per-node via annotations or cluster-wide via ConfigMap
CEL expressions can evaluate any HealthEvent field (errorCode, componentClass, metadata, etc.)
Example policy: event.errorCode.contains("XID-48") || (event.componentClass == "GPU" && event.isFatal)

What it emits:

Kubernetes API call: PATCH /api/v1/nodes/{nodeName}
- Sets spec.unschedulable = true (cordon)
- Optionally sets taints based on configuration
MongoDB update: Sets event status = "QUARANTINED"
Node annotation with quarantine reason

Example API payload:

1 PATCH /api/v1/nodes/gpu-node-42
2 {
3   "spec": {
4     "unschedulable": true,
5     "taints": [
6       {
7         "key": "nvsentinel.nvidia.com/unhealthy",
8         "value": "XID_ERROR_48",
9         "effect": "NoSchedule"
10       }
11     ]
12   },
13   "metadata": {
14     "annotations": {
15       "nvsentinel.nvidia.com/quarantined": "true",
16       "nvsentinel.nvidia.com/quarantine-reason": "XID_ERROR_48",
17       "nvsentinel.nvidia.com/quarantine-timestamp": "2025-10-28T10:15:35Z"
18     }
19   }
20 }

6. Node Drainer Module

What it receives:

MongoDB change stream events
Watches for: nodes cordoned by Quarantine Module

Decision logic:

1 if node.IsQuarantined && !event.DrainOverrides.Skip {
2   drain node pods gracefully
3 }

What it emits:

Kubernetes API calls:
- GET /api/v1/pods (list pods on node)
- DELETE /api/v1/namespaces/{ns}/pods/{pod} (evict each pod)
MongoDB update: Sets event status = "DRAINED"

Eviction payload:

1 POST /api/v1/namespaces/default/pods/training-job-xyz/eviction
2 {
3   "apiVersion": "policy/v1",
4   "kind": "Eviction",
5   "metadata": {
6     "name": "training-job-xyz",
7     "namespace": "default"
8   },
9   "deleteOptions": {
10     "gracePeriodSeconds": 300
11   }
12 }

7. Fault Remediation Module

What it receives:

MongoDB change stream events
Watches for: events with specific RecommendedActions

Decision logic:

1 if event.RecommendedAction == REPLACE_VM {
2   create break-fix ticket CRD
3 }

What it emits:

Kubernetes Custom Resource (CRD):

1 apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2 kind: RebootNode
3 metadata:
4   name: maintenance-gpu-node-42-6720abc123def456789
5 spec:
6   nodeName: gpu-node-42

Note: The CRD is consumed by an external operator (e.g., Janitor) that handles the actual maintenance workflow.

8. Health Events Analyzer

What it receives:

MongoDB change stream events (all events)

What it does:

Pattern detection (recurring errors)
Trend analysis (error frequency increasing)
Correlation (multiple failures on same rack)

What it emits:

New HealthEvents (for correlated/aggregated issues)
Aggregated metrics to Prometheus
Alert annotations to HealthEvents
Dashboard data

Detailed Sequence Diagrams

Scenario 1: GPU XID Error Detection to Node Quarantine

Scenario 2: Full Remediation Flow

Data Transformations

gRPC to MongoDB

Input (gRPC):

1 HealthEvents {
2   version: 1
3   events: [
4     HealthEvent {
5       agent: "gpu-health-monitor"
6       nodeName: "gpu-node-42"
7       isFatal: true
8       // ... other fields
9     }
10   ]
11 }

Output (MongoDB):

1 {
2   "_id": ObjectId("6720abc123def456789"),
3   "createdAt": ISODate("2025-10-28T10:15:30.123Z"),
4   "healthevent": {
5     "version": 1,
6     "agent": "gpu-health-monitor",
7     "nodeName": "gpu-node-42",
8     "isFatal": true
9     // ... all HealthEvent fields preserved
10   },
11   "healtheventstatus": {
12     "nodequarantined": null,                    // null, "Quarantined", "UnQuarantined", or "AlreadyQuarantined"
13     "userpodsevictionstatus": {
14       "status": "NotStarted",                   // "NotStarted", "InProgress", "Failed", "Succeeded", or "AlreadyDrained"
15       "message": ""                             // Optional error/status message
16     },
17     "faultremediated": null,                    // null or boolean
18     "lastremediationtimestamp": null            // null or ISODate
19   }
20 }

MongoDB Change Stream to Module

Change Stream Event:

1 {
2   "_id": {"_data": "..."},
3   "operationType": "insert",
4   "fullDocument": {
5     "_id": ObjectId("6720abc123def456789"),
6     "healthevent": {
7       "version": 1,
8       "agent": "gpu-health-monitor",
9       "isFatal": true,
10       "nodeName": "gpu-node-42"
11       // ... full HealthEvent
12     },
13     "healtheventstatus": {
14       // ... status fields as in MongoDB document
15     },
16     "createdAt": ISODate("2025-10-28T10:15:30.123Z")
17   }
18 }

Module receives:

Deserializes fullDocument into HealthEvent struct
Evaluates based on module-specific logic
Takes action via Kubernetes API

HealthEvent to Kubernetes Node Condition

HealthEvent:

1 {
2   "checkName": "XID_ERROR_48",
3   "isFatal": true,
4   "message": "GPU 0 reported XID 48"
5 }

Kubernetes Node Condition:

1 conditions:
2 - type: GPUHealthy
3   status: "False"
4   reason: XID_ERROR_48
5   message: "GPU 0 reported XID 48"
6   lastTransitionTime: "2025-10-28T10:15:30Z"

HealthEvent to Kubernetes CRD

HealthEvent:

1 {
2   "nodeName": "gpu-node-42",
3   "checkName": "XID_ERROR_48",
4   "recommendedAction": "RESTART_BM"
5 }

RebootNode CRD:

1 apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2 kind: RebootNode
3 metadata:
4   name: maintenance-gpu-node-42-6720abc123def456789
5 spec:
6   nodeName: gpu-node-42

Data Flow Summary

Source	Data Format	Transport	Destination	Action
GPU Monitor	HealthEvent (protobuf)	gRPC	Platform Connectors	Publish event
Syslog Monitor	HealthEvent (protobuf)	gRPC	Platform Connectors	Publish event
CSP Monitor	HealthEvent (protobuf)	gRPC	Platform Connectors	Publish event
Platform Connectors	HealthEvent (BSON)	MongoDB insert	MongoDB	Persist event
Platform Connectors	Node (JSON)	Kubernetes API	K8s Nodes	Update condition
MongoDB	ChangeStream (BSON)	MongoDB change stream	All modules	Subscribe to events
Fault Quarantine	Node (JSON)	Kubernetes API	K8s Nodes	Cordon node
Node Drainer	Pod Eviction (JSON)	Kubernetes API	K8s Pods	Evict pods
Fault Remediation	CRD (YAML)	Kubernetes API	K8s CRDs	Create repair request

Connection Methods

gRPC Connections

Protocol: HTTP/2 + Protocol Buffers
Service Definition: PlatformConnector.HealthEventOccurredV1
Client: Health Monitors (GPU, Syslog, CSP)
Server: Platform Connectors
Port: Configurable (default: 50051)
TLS: Optional (cert-manager integration)

MongoDB Connections

Write Path: Platform Connectors → MongoDB (insert)
Read Path: All core modules ← MongoDB (change streams)
Connection String: mongodb://nvsentinel-mongodb:27017/nvsentinel
Collection: health_events
Indexes: nodeName, agent, created_at, status

Kubernetes API Connections

Authentication: ServiceAccount tokens
Authorization: RBAC (Roles/ClusterRoles)
API Groups: v1 (core), policy/v1 (eviction), custom CRDs
Operations: GET, PATCH, DELETE, CREATE, WATCH

Key Insights

Decoupled Architecture: Monitors don’t know about modules, modules don’t know about monitors
Single Data Model: HealthEvent is the universal language
Event Sourcing: MongoDB change streams enable reactive processing
Kubernetes-Native: Final actions all go through K8s API (auditability)
Idempotent Operations: Modules can re-process events safely
Metadata Rich: HealthEvent metadata field allows extensibility without schema changes

NVSentinel Data Flow Documentation

Table of Contents

Overview

Preflight (optional admission checks)

Core Data Structure: HealthEvent

HealthEvent Message Structure

Example HealthEvent: GPU XID Error

Component Data Flow

1. GPU Health Monitor

2. Syslog Health Monitor

3. CSP Health Monitor

4. Platform Connectors

5. Fault Quarantine Module

6. Node Drainer Module

7. Fault Remediation Module

8. Health Events Analyzer

Detailed Sequence Diagrams

Scenario 1: GPU XID Error Detection to Node Quarantine

Scenario 2: Full Remediation Flow

Data Transformations

gRPC to MongoDB

MongoDB Change Stream to Module

HealthEvent to Kubernetes Node Condition

HealthEvent to Kubernetes CRD

Data Flow Summary

Connection Methods

gRPC Connections

MongoDB Connections

Kubernetes API Connections

Key Insights