Kubernetes Object Monitor | NVIDIA NVSentinel Documentation

Overview

The Kubernetes Object Monitor module watches Kubernetes resources and generates health events when resources enter unhealthy states. This document covers all Helm configuration options for system administrators.

Configuration Reference

Module Enable/Disable

Controls whether the kubernetes-object-monitor module is deployed in the cluster.

1 global:
2   kubernetesObjectMonitor:
3     enabled: true

Resources

Defines CPU and memory resource requests and limits for the kubernetes-object-monitor pod.

1 kubernetes-object-monitor:
2   resources:
3     limits:
4       cpu: 500m
5       memory: 256Mi
6     requests:
7       cpu: 100m
8       memory: 128Mi

Logging

Sets the verbosity level for kubernetes-object-monitor logs.

1 kubernetes-object-monitor:
2   logLevel: info  # Options: debug, info, warn, error

Controller Configuration

Controls behavior of the Kubernetes controller watching resources.

1 kubernetes-object-monitor:
2   maxConcurrentReconciles: 1
3   resyncPeriod: 5m

maxConcurrentReconciles

Maximum number of concurrent reconciliation workers. Higher values allow parallel processing of multiple resources.

resyncPeriod

How often the controller re-evaluates all watched resources even without changes.

Policies Configuration

Policies define which Kubernetes resources to monitor and when to generate health events.

Policy Structure

1 kubernetes-object-monitor:
2   policies:
3     - name: "policy-name"
4       enabled: true
5       resource:
6         group: ""
7         version: v1
8         kind: Node
9         # namespace: gpu-operator  # Optional, only for namespaced resources
10       predicate:
11         expression: |
12           resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0
13       nodeAssociation:
14         expression: resource.spec.nodeName
15       healthEvent:
16         componentClass: Node
17         isFatal: true
18         message: "Error message"
19         recommendedAction: CONTACT_SUPPORT
20         errorCode:
21           - ERROR_CODE
22         quarantineOverrides:
23           force: true  # Or use skip: true; do not set both
24         drainOverrides:
25           skip: true   # Or use force: true; do not set both

Parameters

name

Unique identifier for the policy used in logs and metrics.

enabled

Enables or disables the policy. Disabled policies are not compiled or evaluated.

resource

Specifies the Kubernetes resource type to monitor.

group

API group of the resource. Use empty string "" for core resources (Pod, Node, Service, etc.).

version

API version of the resource (e.g., v1, v1beta1).

kind

Kubernetes Kind of the resource (e.g., Node, Pod, Deployment).

namespace

Optional namespace restriction for namespaced resources. When set, the monitor creates the informer cache for this resource kind only in that namespace. Leave unset to watch all namespaces. Do not set this for cluster-scoped resources such as Node.

predicate

CEL expression that evaluates to true when the resource is in an unhealthy state. Evaluated with resource variable containing the full resource object.

expression

CEL expression accessing the resource via resource variable.

nodeAssociation

Optional CEL expression that maps the resource to a specific Kubernetes node name.

expression

CEL expression that returns a string node name.

healthEvent

Defines the health event to generate when the predicate matches.

componentClass

Component type for the health event (e.g., Node, GPU, Pod).

isFatal

Boolean indicating if this is a fatal error that should trigger quarantine.

message

Human-readable error message included in the health event.

recommendedAction

Action code from health event proto (see health_event.proto).

errorCode

Array of error code strings for categorization and filtering.

quarantineOverrides

Optional behavior override for fault-quarantine. force forces node cordoning regardless of normal rules; skip skips node cordoning for the generated health event. Set at most one of force or skip.

drainOverrides

Optional behavior override for node-drainer. force forces immediate pod eviction regardless of configured namespace drain modes; skip skips pod eviction and marks the event as already drained. Set at most one of force or skip.

CEL Expressions

Predicate Expressions

Access the resource object via the resource variable.

Common Patterns

Check if condition exists and is True:

1 expression: |
2   resource.status.conditions.filter(c, c.type == "Ready" && c.status == "True").size() > 0

Check field value:

1 expression: |
2   has(resource.status.phase) && resource.status.phase == "Failed"

Check label exists:

1 expression: |
2   'failure' in resource.metadata.labels && resource.metadata.labels['failure'] == 'true'

Node Association Expressions

Map resources to nodes using CEL expressions.

Direct Field Reference

1 nodeAssociation:
2   expression: resource.spec.nodeName

Using lookup() Function

The lookup() function retrieves other Kubernetes resources during evaluation.

Signature:

lookup(version, kind, namespace, name) -> resource object

Parameters:

version (string) - API version (e.g., “v1”, “apps/v1”)
kind (string) - Resource Kind (e.g., “Pod”, “Node”)
namespace (string) - Namespace (use empty string "" for cluster-scoped resources)
name (string) - Resource name

Examples:

Get node from pod reference:

1 nodeAssociation:
2   expression: |
3     lookup('v1', 'Pod', resource.metadata.namespace, resource.spec.podName).spec.nodeName

Policy Examples

Example 1: Node Not Ready

Monitor nodes that are not in Ready state.

1 policies:
2   - name: node-not-ready
3     enabled: true
4     resource:
5       group: ""
6       version: v1
7       kind: Node
8     predicate:
9       expression: |
10         resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0
11     healthEvent:
12       componentClass: Node
13       isFatal: true
14       message: "Node is not ready"
15       recommendedAction: CONTACT_SUPPORT
16       errorCode:
17         - NODE_NOT_READY

Example 2: Node Needs Repair

Monitor custom node conditions.

1 policies:
2   - name: NodeNeedsRepair
3     enabled: true
4     resource:
5       group: ""
6       version: v1
7       kind: Node
8     predicate:
9       expression: |
10         resource.status.conditions.filter(c, c.type == "kubernetes.acme.com/NeedsRepair" && c.status == "True").size() > 0
11     healthEvent:
12       componentClass: Node
13       isFatal: true
14       message: "Node needs repair"
15       recommendedAction: REPLACE_VM
16       errorCode:
17         - NODE_NEEDS_REPAIR

RBAC Permissions

RBAC permissions are automatically generated based on configured policies:

Node resources: Get write permissions (patch/update) for annotations
All other resources: Get read-only permissions (get/list/watch)

When adding a new policy for a Custom Resource, ensure the CRD is installed before deploying the kubernetes-object-monitor.

Policy Design Guidelines

Predicate Specificity: Write predicates that clearly identify unhealthy states
Node Association: Provide nodeAssociation for non-Node resources to enable quarantine
Error Codes: Use descriptive error codes for filtering and categorization
Fatal vs NonFatal: Set isFatal: true only for errors requiring node quarantine
Testing: Use dry-run mode to test policy expressions before production deployment
Performance: Avoid expensive operations in predicates (e.g., multiple nested lookups)
Namespace Scope: For namespaced resources with many objects, set resource.namespace to reduce informer cache memory usage