NVSentinel Data Flow Documentation

View as Markdown

This document illustrates how data flows through the NVSentinel system, from detection through remediation.

Table of Contents


Overview

NVSentinel uses a publish-subscribe pattern through MongoDB change streams:

  1. Health Monitors detect issues and publish HealthEvent messages via gRPC
  2. Platform Connectors persist events to MongoDB and update Kubernetes
  3. Core Modules subscribe to MongoDB change streams and react independently
  4. Kubernetes API is the final actuator for all remediation actions

Preflight (optional admission checks)

Preflight does not publish events through MongoDB or platform connectors. A mutating admission webhook injects init containers into GPU pods in labeled namespaces. When a check detects a failure, the init container sends a health event to the platform connector over the Unix domain socket (PLATFORM_CONNECTOR_SOCKET), which then follows the normal ingestion path. This is separate from the change-stream pipeline above because healthy checks produce no events at all. Multi-node checks use gang discovery and ConfigMap coordination (ADR-026, configuration).


Core Data Structure: HealthEvent

All data flowing through NVSentinel is based on the HealthEvent protobuf message:

HealthEvent Message Structure

1message HealthEvent {
2 uint32 version = 1; // Protocol version
3
4 // Source identification
5 string agent = 2; // Monitor name (e.g., "gpu-health-monitor")
6 string componentClass = 3; // Component type (e.g., "GPU", "NIC")
7 string nodeName = 13; // Kubernetes node name
8
9 // Health status
10 string checkName = 4; // Specific check (e.g., "XID_ERROR")
11 bool isFatal = 5; // Critical failure
12 bool isHealthy = 6; // Current health state
13 string message = 7; // Human-readable description
14
15 // Classification
16 RecommendedAction recommendedAction = 8; // What should be done
17 repeated string errorCode = 9; // Error identifiers (e.g., ["XID-48"])
18 repeated Entity entitiesImpacted = 10; // Affected resources
19
20 // Metadata
21 map<string, string> metadata = 11; // Key-value pairs (GPU UUID, driver version, etc.)
22 google.protobuf.Timestamp generatedTimestamp = 12;
23
24 // Behavior overrides
25 BehaviourOverrides quarantineOverrides = 14;
26 BehaviourOverrides drainOverrides = 15;
27}
28
29enum RecommendedAction {
30 NONE = 0;
31 COMPONENT_RESET = 2;
32 CONTACT_SUPPORT = 5;
33 RESTART_VM = 15;
34 RESTART_BM = 24;
35 REPLACE_VM = 25;
36 UNKNOWN = 99;
37}
38
39message Entity {
40 string entityType = 1; // e.g., "GPU", "NODE", "POD"
41 string entityValue = 2; // e.g., GPU UUID, node name
42}

Example HealthEvent: GPU XID Error

1{
2 "version": 1,
3 "agent": "gpu-health-monitor",
4 "componentClass": "GPU",
5 "checkName": "XID_ERROR_48",
6 "isFatal": true,
7 "isHealthy": false,
8 "message": "GPU 0 reported XID 48 (Double Bit ECC Error)",
9 "recommendedAction": "REPLACE_VM",
10 "errorCode": ["XID-48"],
11 "entitiesImpacted": [
12 {
13 "entityType": "GPU",
14 "entityValue": "GPU-12345678-abcd-1234-abcd-123456789abc"
15 }
16 ],
17 "metadata": {
18 "gpu_uuid": "GPU-12345678-abcd-1234-abcd-123456789abc",
19 "gpu_index": "0",
20 "driver_version": "535.104.05",
21 "severity": "CRITICAL"
22 },
23 "generatedTimestamp": "2025-10-28T10:15:30Z",
24 "nodeName": "gpu-node-42",
25 "quarantineOverrides": null,
26 "drainOverrides": null
27}

Component Data Flow

1. GPU Health Monitor

What it captures:

  • GPU temperature and power
  • ECC errors (single-bit, double-bit)
  • GPU throttling events

What it emits:

  • HealthEvent via gRPC to Platform Connectors
  • Metrics to Prometheus (separate path)

Example flow:

DCGM reports ECC error on GPU 0
Monitor creates HealthEvent:
- agent: "gpu-health-monitor"
- componentClass: "GPU"
- checkName: "ECC_ERROR"
- isFatal: false
- recommendedAction: MONITOR
- errorCode: ["ECC-DBE"]
Sends via gRPC: HealthEventOccurredV1(HealthEvents)

2. Syslog Health Monitor

What it captures:

  • XID errors (GPU hardware faults)
  • SXID errors (GPU software errors)
  • GPU fell off the bus events

What it emits:

  • HealthEvent via gRPC to Platform Connectors

Example flow:

journalctl shows XID 48 error
Monitor creates HealthEvent:
- agent: "syslog-health-monitor"
- componentClass: "GPU"
- checkName: "XID_ERROR_48"
- isFatal: true
- recommendedAction: REPLACE_VM
- errorCode: ["XID-48"]
Sends via gRPC

3. CSP Health Monitor

What it captures:

  • Cloud provider maintenance schedules (GCP, AWS, OCI)
  • Upcoming VM migrations
  • Hardware replacement notices

What it emits:

  • HealthEvent via gRPC to Platform Connectors

Example flow:

GCP API reports scheduled maintenance
Monitor creates HealthEvent:
- agent: "csp-health-monitor"
- componentClass: "CSP"
- checkName: "SCHEDULED_MAINTENANCE"
- isFatal: false
- recommendedAction: NONE
- metadata: {"maintenance_start": "2025-11-01T00:00:00Z"}
Sends via gRPC

4. Platform Connectors

What it receives:

  • HealthEvents message via gRPC (array of HealthEvent)
  • gRPC method: HealthEventOccurredV1(HealthEvents) returns (Empty)

What it does:

  1. Validates the event (schema, required fields)
  2. Inserts event into MongoDB health_events collection
  3. Updates Kubernetes node condition (if applicable)
  4. Updates Kubernetes node events (if applicable)

What it emits:

  • MongoDB document (HealthEvent serialized)
  • Kubernetes Node condition update (for fatal failures)
  • Kubernetes Node events (for non-fatal issues)
  • Metrics to Prometheus

Data transformation:

1gRPC HealthEvents
2
3Validate each HealthEvent
4
5MongoDB Insert: {
6 "_id": ObjectId("..."),
7 "createdAt": ISODate("2025-10-28T10:15:30Z"),
8 "healthevent": {
9 "version": 1,
10 "agent": "gpu-health-monitor",
11 "nodeName": "gpu-node-42",
12 // ... all other HealthEvent fields
13 },
14 "healtheventstatus": {
15 "nodequarantined": null, // null, "Quarantined", "UnQuarantined", or "AlreadyQuarantined"
16 "userpodsevictionstatus": {
17 "status": "NotStarted", // "NotStarted", "InProgress", "Failed", "Succeeded", or "AlreadyDrained"
18 "message": "" // Optional error/status message
19 },
20 "faultremediated": null, // null or boolean
21 "lastremediationtimestamp": null // null or ISODate
22 }
23}
24
25If isFatal == true:
26 Kubernetes Node Condition: {
27 "type": "GPUHealthy",
28 "status": "False",
29 "reason": "XID_ERROR_48",
30 "message": "GPU 0 reported XID 48"
31 }
32Else:
33 Kubernetes Node Event: {
34 "type": "Warning",
35 "reason": "GPUHealthIssue",
36 "message": "GPU 0 reported ECC error",
37 "involvedObject": {Node}
38 }

5. Fault Quarantine Module

What it receives: What it receives:

  • MongoDB change stream events
  • Watches for: new HealthEvents with isFatal: true or specific error codes

Decision logic:

1// Evaluate CEL-based policy first
2policy := getCELPolicy(event.NodeName)
3if policy.Evaluate(event) {
4 // CEL policy determines if quarantine is needed
5 if !event.QuarantineOverrides.Skip {
6 cordon node
7 }
8}
9
10// Fallback to built-in logic
11if event.IsFatal || event.RecommendedAction == REPLACE_VM {
12 if !event.QuarantineOverrides.Skip {
13 cordon node
14 }
15}

CEL Policy Evaluation:

  • Uses Common Expression Language (CEL) for flexible policy definitions
  • Policies can be defined per-node via annotations or cluster-wide via ConfigMap
  • CEL expressions can evaluate any HealthEvent field (errorCode, componentClass, metadata, etc.)
  • Example policy: event.errorCode.contains("XID-48") || (event.componentClass == "GPU" && event.isFatal)

What it emits:

  • Kubernetes API call: PATCH /api/v1/nodes/{nodeName}
    • Sets spec.unschedulable = true (cordon)
    • Optionally sets taints based on configuration
  • MongoDB update: Sets event status = "QUARANTINED"
  • Node annotation with quarantine reason

Example API payload:

1PATCH /api/v1/nodes/gpu-node-42
2{
3 "spec": {
4 "unschedulable": true,
5 "taints": [
6 {
7 "key": "nvsentinel.nvidia.com/unhealthy",
8 "value": "XID_ERROR_48",
9 "effect": "NoSchedule"
10 }
11 ]
12 },
13 "metadata": {
14 "annotations": {
15 "nvsentinel.nvidia.com/quarantined": "true",
16 "nvsentinel.nvidia.com/quarantine-reason": "XID_ERROR_48",
17 "nvsentinel.nvidia.com/quarantine-timestamp": "2025-10-28T10:15:35Z"
18 }
19 }
20}

6. Node Drainer Module

What it receives:

  • MongoDB change stream events
  • Watches for: nodes cordoned by Quarantine Module

Decision logic:

1if node.IsQuarantined && !event.DrainOverrides.Skip {
2 drain node pods gracefully
3}

What it emits:

  • Kubernetes API calls:
    • GET /api/v1/pods (list pods on node)
    • DELETE /api/v1/namespaces/{ns}/pods/{pod} (evict each pod)
  • MongoDB update: Sets event status = "DRAINED"

Eviction payload:

1POST /api/v1/namespaces/default/pods/training-job-xyz/eviction
2{
3 "apiVersion": "policy/v1",
4 "kind": "Eviction",
5 "metadata": {
6 "name": "training-job-xyz",
7 "namespace": "default"
8 },
9 "deleteOptions": {
10 "gracePeriodSeconds": 300
11 }
12}

7. Fault Remediation Module

What it receives:

  • MongoDB change stream events
  • Watches for: events with specific RecommendedActions

Decision logic:

1if event.RecommendedAction == REPLACE_VM {
2 create break-fix ticket CRD
3}

What it emits:

  • Kubernetes Custom Resource (CRD):
1apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2kind: RebootNode
3metadata:
4 name: maintenance-gpu-node-42-6720abc123def456789
5spec:
6 nodeName: gpu-node-42

Note: The CRD is consumed by an external operator (e.g., Janitor) that handles the actual maintenance workflow.

8. Health Events Analyzer

What it receives:

  • MongoDB change stream events (all events)

What it does:

  • Pattern detection (recurring errors)
  • Trend analysis (error frequency increasing)
  • Correlation (multiple failures on same rack)

What it emits:

  • New HealthEvents (for correlated/aggregated issues)
  • Aggregated metrics to Prometheus
  • Alert annotations to HealthEvents
  • Dashboard data

Detailed Sequence Diagrams

Scenario 1: GPU XID Error Detection to Node Quarantine

Scenario 2: Full Remediation Flow


Data Transformations

gRPC to MongoDB

Input (gRPC):

1HealthEvents {
2 version: 1
3 events: [
4 HealthEvent {
5 agent: "gpu-health-monitor"
6 nodeName: "gpu-node-42"
7 isFatal: true
8 // ... other fields
9 }
10 ]
11}

Output (MongoDB):

1{
2 "_id": ObjectId("6720abc123def456789"),
3 "createdAt": ISODate("2025-10-28T10:15:30.123Z"),
4 "healthevent": {
5 "version": 1,
6 "agent": "gpu-health-monitor",
7 "nodeName": "gpu-node-42",
8 "isFatal": true
9 // ... all HealthEvent fields preserved
10 },
11 "healtheventstatus": {
12 "nodequarantined": null, // null, "Quarantined", "UnQuarantined", or "AlreadyQuarantined"
13 "userpodsevictionstatus": {
14 "status": "NotStarted", // "NotStarted", "InProgress", "Failed", "Succeeded", or "AlreadyDrained"
15 "message": "" // Optional error/status message
16 },
17 "faultremediated": null, // null or boolean
18 "lastremediationtimestamp": null // null or ISODate
19 }
20}

MongoDB Change Stream to Module

Change Stream Event:

1{
2 "_id": {"_data": "..."},
3 "operationType": "insert",
4 "fullDocument": {
5 "_id": ObjectId("6720abc123def456789"),
6 "healthevent": {
7 "version": 1,
8 "agent": "gpu-health-monitor",
9 "isFatal": true,
10 "nodeName": "gpu-node-42"
11 // ... full HealthEvent
12 },
13 "healtheventstatus": {
14 // ... status fields as in MongoDB document
15 },
16 "createdAt": ISODate("2025-10-28T10:15:30.123Z")
17 }
18}

Module receives:

  • Deserializes fullDocument into HealthEvent struct
  • Evaluates based on module-specific logic
  • Takes action via Kubernetes API

HealthEvent to Kubernetes Node Condition

HealthEvent:

1{
2 "checkName": "XID_ERROR_48",
3 "isFatal": true,
4 "message": "GPU 0 reported XID 48"
5}

Kubernetes Node Condition:

1conditions:
2- type: GPUHealthy
3 status: "False"
4 reason: XID_ERROR_48
5 message: "GPU 0 reported XID 48"
6 lastTransitionTime: "2025-10-28T10:15:30Z"

HealthEvent to Kubernetes CRD

HealthEvent:

1{
2 "nodeName": "gpu-node-42",
3 "checkName": "XID_ERROR_48",
4 "recommendedAction": "RESTART_BM"
5}

RebootNode CRD:

1apiVersion: janitor.dgxc.nvidia.com/v1alpha1
2kind: RebootNode
3metadata:
4 name: maintenance-gpu-node-42-6720abc123def456789
5spec:
6 nodeName: gpu-node-42

Data Flow Summary

SourceData FormatTransportDestinationAction
GPU MonitorHealthEvent (protobuf)gRPCPlatform ConnectorsPublish event
Syslog MonitorHealthEvent (protobuf)gRPCPlatform ConnectorsPublish event
CSP MonitorHealthEvent (protobuf)gRPCPlatform ConnectorsPublish event
Platform ConnectorsHealthEvent (BSON)MongoDB insertMongoDBPersist event
Platform ConnectorsNode (JSON)Kubernetes APIK8s NodesUpdate condition
MongoDBChangeStream (BSON)MongoDB change streamAll modulesSubscribe to events
Fault QuarantineNode (JSON)Kubernetes APIK8s NodesCordon node
Node DrainerPod Eviction (JSON)Kubernetes APIK8s PodsEvict pods
Fault RemediationCRD (YAML)Kubernetes APIK8s CRDsCreate repair request

Connection Methods

gRPC Connections

  • Protocol: HTTP/2 + Protocol Buffers
  • Service Definition: PlatformConnector.HealthEventOccurredV1
  • Client: Health Monitors (GPU, Syslog, CSP)
  • Server: Platform Connectors
  • Port: Configurable (default: 50051)
  • TLS: Optional (cert-manager integration)

MongoDB Connections

  • Write Path: Platform Connectors → MongoDB (insert)
  • Read Path: All core modules ← MongoDB (change streams)
  • Connection String: mongodb://nvsentinel-mongodb:27017/nvsentinel
  • Collection: health_events
  • Indexes: nodeName, agent, created_at, status

Kubernetes API Connections

  • Authentication: ServiceAccount tokens
  • Authorization: RBAC (Roles/ClusterRoles)
  • API Groups: v1 (core), policy/v1 (eviction), custom CRDs
  • Operations: GET, PATCH, DELETE, CREATE, WATCH

Key Insights

  1. Decoupled Architecture: Monitors don’t know about modules, modules don’t know about monitors
  2. Single Data Model: HealthEvent is the universal language
  3. Event Sourcing: MongoDB change streams enable reactive processing
  4. Kubernetes-Native: Final actions all go through K8s API (auditability)
  5. Idempotent Operations: Modules can re-process events safely
  6. Metadata Rich: HealthEvent metadata field allows extensibility without schema changes