Kubernetes Object Monitor Configuration
Overview
The Kubernetes Object Monitor module watches Kubernetes resources and generates health events when resources enter unhealthy states. This document covers all Helm configuration options for system administrators.
Configuration Reference
Module Enable/Disable
Controls whether the kubernetes-object-monitor module is deployed in the cluster.
Resources
Defines CPU and memory resource requests and limits for the kubernetes-object-monitor pod.
Logging
Sets the verbosity level for kubernetes-object-monitor logs.
Controller Configuration
Controls behavior of the Kubernetes controller watching resources.
maxConcurrentReconciles
Maximum number of concurrent reconciliation workers. Higher values allow parallel processing of multiple resources.
resyncPeriod
How often the controller re-evaluates all watched resources even without changes.
Policies Configuration
Policies define which Kubernetes resources to monitor and when to generate health events.
Policy Structure
Parameters
name
Unique identifier for the policy used in logs and metrics.
enabled
Enables or disables the policy. Disabled policies are not compiled or evaluated.
resource
Specifies the Kubernetes resource type to monitor.
group
API group of the resource. Use empty string "" for core resources (Pod, Node, Service, etc.).
version
API version of the resource (e.g., v1, v1beta1).
kind
Kubernetes Kind of the resource (e.g., Node, Pod, Deployment).
predicate
CEL expression that evaluates to true when the resource is in an unhealthy state. Evaluated with resource variable containing the full resource object.
expression
CEL expression accessing the resource via resource variable.
nodeAssociation
Optional CEL expression that maps the resource to a specific Kubernetes node name.
expression
CEL expression that returns a string node name.
healthEvent
Defines the health event to generate when the predicate matches.
componentClass
Component type for the health event (e.g., Node, GPU, Pod).
isFatal
Boolean indicating if this is a fatal error that should trigger quarantine.
message
Human-readable error message included in the health event.
recommendedAction
Action code from health event proto (see health_event.proto).
errorCode
Array of error code strings for categorization and filtering.
CEL Expressions
Predicate Expressions
Access the resource object via the resource variable.
Common Patterns
Check if condition exists and is True:
Check field value:
Check label exists:
Node Association Expressions
Map resources to nodes using CEL expressions.
Direct Field Reference
Using lookup() Function
The lookup() function retrieves other Kubernetes resources during evaluation.
Signature:
Parameters:
version(string) - API version (e.g., “v1”, “apps/v1”)kind(string) - Resource Kind (e.g., “Pod”, “Node”)namespace(string) - Namespace (use empty string""for cluster-scoped resources)name(string) - Resource name
Examples:
Get node from pod reference:
Policy Examples
Example 1: Node Not Ready
Monitor nodes that are not in Ready state.
Example 2: Node Needs Repair
Monitor custom node conditions.
RBAC Permissions
RBAC permissions are automatically generated based on configured policies:
- Node resources: Get write permissions (patch/update) for annotations
- All other resources: Get read-only permissions (get/list/watch)
When adding a new policy for a Custom Resource, ensure the CRD is installed before deploying the kubernetes-object-monitor.
Policy Design Guidelines
- Predicate Specificity: Write predicates that clearly identify unhealthy states
- Node Association: Provide
nodeAssociationfor non-Node resources to enable quarantine - Error Codes: Use descriptive error codes for filtering and categorization
- Fatal vs NonFatal: Set
isFatal: trueonly for errors requiring node quarantine - Testing: Use dry-run mode to test policy expressions before production deployment
- Performance: Avoid expensive operations in predicates (e.g., multiple nested lookups)