For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Configuration Reference
  • Module Enable/Disable
  • Resources
  • Logging
  • Controller Configuration
  • maxConcurrentReconciles
  • resyncPeriod
  • Policies Configuration
  • Policy Structure
  • Parameters
  • name
  • enabled
  • resource
  • group
  • version
  • kind
  • predicate
  • expression
  • nodeAssociation
  • expression
  • healthEvent
  • componentClass
  • isFatal
  • message
  • recommendedAction
  • errorCode
  • quarantineOverrides
  • drainOverrides
  • CEL Expressions
  • Predicate Expressions
  • Common Patterns
  • Node Association Expressions
  • Direct Field Reference
  • Using lookup() Function
  • Signature:
  • Parameters:
  • Examples:
  • Policy Examples
  • Example 1: Node Not Ready
  • Example 2: Node Needs Repair
  • RBAC Permissions
  • Policy Design Guidelines
Configuration

Kubernetes Object Monitor Configuration

||View as Markdown|
Previous

CSP Health Monitor

Next

Fault Quarantine

Overview

The Kubernetes Object Monitor module watches Kubernetes resources and generates health events when resources enter unhealthy states. This document covers all Helm configuration options for system administrators.

Configuration Reference

Module Enable/Disable

Controls whether the kubernetes-object-monitor module is deployed in the cluster.

1global:
2 kubernetesObjectMonitor:
3 enabled: true

Resources

Defines CPU and memory resource requests and limits for the kubernetes-object-monitor pod.

1kubernetes-object-monitor:
2 resources:
3 limits:
4 cpu: 500m
5 memory: 256Mi
6 requests:
7 cpu: 100m
8 memory: 128Mi

Logging

Sets the verbosity level for kubernetes-object-monitor logs.

1kubernetes-object-monitor:
2 logLevel: info # Options: debug, info, warn, error

Controller Configuration

Controls behavior of the Kubernetes controller watching resources.

1kubernetes-object-monitor:
2 maxConcurrentReconciles: 1
3 resyncPeriod: 5m

maxConcurrentReconciles

Maximum number of concurrent reconciliation workers. Higher values allow parallel processing of multiple resources.

resyncPeriod

How often the controller re-evaluates all watched resources even without changes.

Policies Configuration

Policies define which Kubernetes resources to monitor and when to generate health events.

Policy Structure

1kubernetes-object-monitor:
2 policies:
3 - name: "policy-name"
4 enabled: true
5 resource:
6 group: ""
7 version: v1
8 kind: Node
9 predicate:
10 expression: |
11 resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0
12 nodeAssociation:
13 expression: resource.spec.nodeName
14 healthEvent:
15 componentClass: Node
16 isFatal: true
17 message: "Error message"
18 recommendedAction: CONTACT_SUPPORT
19 errorCode:
20 - ERROR_CODE
21 quarantineOverrides:
22 force: true # Or use skip: true; do not set both
23 drainOverrides:
24 skip: true # Or use force: true; do not set both

Parameters

name

Unique identifier for the policy used in logs and metrics.

enabled

Enables or disables the policy. Disabled policies are not compiled or evaluated.

resource

Specifies the Kubernetes resource type to monitor.

group

API group of the resource. Use empty string "" for core resources (Pod, Node, Service, etc.).

version

API version of the resource (e.g., v1, v1beta1).

kind

Kubernetes Kind of the resource (e.g., Node, Pod, Deployment).

predicate

CEL expression that evaluates to true when the resource is in an unhealthy state. Evaluated with resource variable containing the full resource object.

expression

CEL expression accessing the resource via resource variable.

nodeAssociation

Optional CEL expression that maps the resource to a specific Kubernetes node name.

expression

CEL expression that returns a string node name.

healthEvent

Defines the health event to generate when the predicate matches.

componentClass

Component type for the health event (e.g., Node, GPU, Pod).

isFatal

Boolean indicating if this is a fatal error that should trigger quarantine.

message

Human-readable error message included in the health event.

recommendedAction

Action code from health event proto (see health_event.proto).

errorCode

Array of error code strings for categorization and filtering.

quarantineOverrides

Optional behavior override for fault-quarantine. force forces node cordoning regardless of normal rules; skip skips node cordoning for the generated health event. Set at most one of force or skip.

drainOverrides

Optional behavior override for node-drainer. force forces immediate pod eviction regardless of configured namespace drain modes; skip skips pod eviction and marks the event as already drained. Set at most one of force or skip.

CEL Expressions

Predicate Expressions

Access the resource object via the resource variable.

Common Patterns

Check if condition exists and is True:

1expression: |
2 resource.status.conditions.filter(c, c.type == "Ready" && c.status == "True").size() > 0

Check field value:

1expression: |
2 has(resource.status.phase) && resource.status.phase == "Failed"

Check label exists:

1expression: |
2 'failure' in resource.metadata.labels && resource.metadata.labels['failure'] == 'true'

Node Association Expressions

Map resources to nodes using CEL expressions.

Direct Field Reference

1nodeAssociation:
2 expression: resource.spec.nodeName

Using lookup() Function

The lookup() function retrieves other Kubernetes resources during evaluation.

Signature:
lookup(version, kind, namespace, name) -> resource object
Parameters:
  • version (string) - API version (e.g., “v1”, “apps/v1”)
  • kind (string) - Resource Kind (e.g., “Pod”, “Node”)
  • namespace (string) - Namespace (use empty string "" for cluster-scoped resources)
  • name (string) - Resource name
Examples:

Get node from pod reference:

1nodeAssociation:
2 expression: |
3 lookup('v1', 'Pod', resource.metadata.namespace, resource.spec.podName).spec.nodeName

Policy Examples

Example 1: Node Not Ready

Monitor nodes that are not in Ready state.

1policies:
2 - name: node-not-ready
3 enabled: true
4 resource:
5 group: ""
6 version: v1
7 kind: Node
8 predicate:
9 expression: |
10 resource.status.conditions.filter(c, c.type == "Ready" && c.status == "False").size() > 0
11 healthEvent:
12 componentClass: Node
13 isFatal: true
14 message: "Node is not ready"
15 recommendedAction: CONTACT_SUPPORT
16 errorCode:
17 - NODE_NOT_READY

Example 2: Node Needs Repair

Monitor custom node conditions.

1policies:
2 - name: NodeNeedsRepair
3 enabled: true
4 resource:
5 group: ""
6 version: v1
7 kind: Node
8 predicate:
9 expression: |
10 resource.status.conditions.filter(c, c.type == "kubernetes.acme.com/NeedsRepair" && c.status == "True").size() > 0
11 healthEvent:
12 componentClass: Node
13 isFatal: true
14 message: "Node needs repair"
15 recommendedAction: REPLACE_VM
16 errorCode:
17 - NODE_NEEDS_REPAIR

RBAC Permissions

RBAC permissions are automatically generated based on configured policies:

  • Node resources: Get write permissions (patch/update) for annotations
  • All other resources: Get read-only permissions (get/list/watch)

When adding a new policy for a Custom Resource, ensure the CRD is installed before deploying the kubernetes-object-monitor.

Policy Design Guidelines

  1. Predicate Specificity: Write predicates that clearly identify unhealthy states
  2. Node Association: Provide nodeAssociation for non-Node resources to enable quarantine
  3. Error Codes: Use descriptive error codes for filtering and categorization
  4. Fatal vs NonFatal: Set isFatal: true only for errors requiring node quarantine
  5. Testing: Use dry-run mode to test policy expressions before production deployment
  6. Performance: Avoid expensive operations in predicates (e.g., multiple nested lookups)