Kubernetes Object Monitor | NVIDIA NVSentinel Documentation

Overview

The Kubernetes Object Monitor watches any Kubernetes resource (nodes, pods, custom resources, etc.) and generates health events when they enter unhealthy states. It’s a policy-based monitor that uses CEL (Common Expression Language) expressions to detect problems in your cluster resources.

Think of it as a customizable watchdog for your Kubernetes cluster - you define what “unhealthy” means for different resources, and it alerts NVSentinel when problems occur.

Why Do You Need This?

While NVSentinel includes specialized monitors for GPUs and system logs, your cluster health depends on many other factors:

Node conditions: Nodes can become NotReady, have disk pressure, memory pressure, or network issues
Custom resources: Your application’s CRDs (Custom Resource Definitions) may have status fields indicating failures
Application-specific health: Resources managed by your operators or controllers may need monitoring
Integration with existing systems: Quickly integrate external monitoring systems or tools into NVSentinel by exposing their status as Kubernetes resources

The Kubernetes Object Monitor fills these gaps by letting you define custom health checks for any resource in your cluster using simple CEL expressions. This provides a quick way to integrate existing systems into NVSentinel without writing custom monitors.

How It Works

The monitor operates using policies that you define:

Watch resources: Uses Kubernetes controllers to watch specified resource types (Nodes, Pods, Jobs, CRDs, etc.)
Evaluate health state: Evaluates CEL expressions against resource state
Detect unhealthy state: When a CEL expression evaluates to true, the resource is considered unhealthy and an unhealthy event is generated
Detect recovery: When a CEL expression evaluates to false, the resource is considered healthy and a healthy event is automatically sent
Map to nodes: Associates the health event with a specific node using CEL expressions
Publish events: Sends health events to Platform Connectors for processing by NVSentinel core modules

The monitor automatically creates Kubernetes RBAC permissions based on your policies, granting read access to the resources you want to monitor.

Configuration

Configure the Kubernetes Object Monitor through Helm values by defining policies:

1 kubernetes-object-monitor:
2   enabled: true
3   
4   maxConcurrentReconciles: 1
5   resyncPeriod: 5m
6   
7   policies:
8     # Example 1: Monitor node readiness
9     - name: node-not-ready
10       enabled: true
11       resource:
12         group: ""         # Core API group (empty string)
13         version: v1
14         kind: Node
15       predicate:
16         expression: |
17           resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0
18       healthEvent:
19         componentClass: Node
20         isFatal: true
21         message: "Node is not ready"
22         recommendedAction: CONTACT_SUPPORT
23         errorCode:
24           - NODE_NOT_READY
25     
26     # Example 2: Monitor custom resource with node association
27     - name: gpu-job-failed
28       enabled: true
29       resource:
30         group: batch.example.com
31         version: v1alpha1
32         kind: GPUJob
33         namespace: gpu-operator  # Optional: restrict informer cache to one namespace
34       predicate:
35         # Detect when job fails
36         expression: |
37           has(resource.status.state) && resource.status.state == "Failed"
38       nodeAssociation:
39         # Map this job to a specific node
40         expression: resource.spec.nodeName
41       healthEvent:
42         componentClass: GPU
43         isFatal: false
44         message: "GPU job failed on node"
45         recommendedAction: CONTACT_SUPPORT
46         errorCode:
47           - GPU_JOB_FAILED

Policy Configuration

Each policy has these components:

Resource Selection

1 resource:
2   group: ""              # API group (empty for core resources)
3   version: v1            # API version
4   kind: Node             # Resource kind
5   # namespace: gpu-operator # Optional, only for namespaced resources

Leave namespace unset to watch all namespaces for that resource kind. For namespaced resources with large object counts, setting it reduces informer cache memory usage. Do not set it for cluster-scoped resources such as Node.

Predicate (Detection Logic)

1 predicate:
2   expression: |
3     # CEL expression that returns true when resource is unhealthy
4     # When true: unhealthy event is sent
5     # When false: healthy event is automatically sent
6     resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0

Available variables in predicates:

resource: The Kubernetes resource being evaluated
now: Current timestamp
lookup(version, kind, namespace, name): Fetch related resources

Node Association (Optional)

1 nodeAssociation:
2   expression: resource.spec.nodeName  # CEL expression that returns node name

For resources that don’t directly reference a node, you can use lookup() to traverse relationships:

1 nodeAssociation:
2   # Get node from a related Pod
3   expression: |
4     lookup('v1', 'Pod', resource.metadata.namespace, resource.spec.podName).spec.nodeName

Health Event Template

1 healthEvent:
2   componentClass: Node           # Component type (Node, GPU, etc.)
3   isFatal: true                 # Severity flag
4   message: "Node is not ready"  # Human-readable message
5   recommendedAction: CONTACT_SUPPORT  # Action hint
6   errorCode:
7     - NODE_NOT_READY            # Error codes for classification
8   quarantineOverrides:          # Optional: override node cordon behavior
9     force: true                 # Or use skip: true; do not set both
10   drainOverrides:               # Optional: override pod eviction behavior
11     skip: true                  # Or use force: true; do not set both

For each override block, force and skip are mutually exclusive. Use force when this policy should perform the action regardless of normal rules, or skip when this policy should bypass the action.

Key Features

Policy-Based Monitoring

Define custom health checks for any Kubernetes resource using declarative policies - no code required.

CEL Expression Language

Use CEL for flexible, powerful condition evaluation with access to the full resource object.

Resource Relationships

The lookup() function lets you traverse resource relationships to associate health events with nodes.

Automatic RBAC

Kubernetes permissions are automatically generated based on your policies - you don’t manage RBAC manually.

State Tracking

Maintains state for each resource to detect transitions between healthy and unhealthy states.

Extensible

Monitor any resource: core resources (Nodes, Pods), namespaced resources, cluster-scoped resources, or CRDs.

Controller-Runtime Based

Uses Kubernetes controller-runtime for efficient, scalable resource watching with caching.