Monitoring Critical Operators | NVIDIA NVSentinel Documentation

Overview

For an NVIDIA GPU cluster to function correctly, critical infrastructure components under gpu-operator and network-operator namespaces must be healthy. If these operators fail, the underlying hardware cannot be utilized effectively.

NVSentinel provides a built-in mechanism to monitor these operators and report health events when their pods are not running correctly.

Configuration

To monitor the GPU and Network operators, you must enable the kubernetes-object-monitor component and define the monitoring policies in your NVSentinel values.yaml.

These policies monitor DaemonSet pods in the gpu-operator and network-operator namespaces. A health event is generated if a DaemonSet pod:

Has been assigned to a node, AND
Is unhealthy: either not in Running/Succeeded state, OR has a container in CrashLoopBackOff

The policy detects pods that are genuinely stuck in any non-progressing state such as:

Stuck in init container execution (pod phase is Pending)
Pending due to resource constraints
CrashLoopBackOff in main containers (pod phase is Running but container is crashing)
CrashLoopBackOff in init containers (pod phase is Pending)
ImagePullBackOff errors
Any other state preventing the pod from becoming healthy

Pod Health Tracking (DaemonSet Only)

The policies track individual pods owned by daemonsets by name. When a pod’s health state changes:

Pod becomes unhealthy → Node is cordoned
Pod becomes healthy → Node is uncordoned
Pod is deleted → Node is uncordoned (if a replacement pod comes up unhealthy, it will re-cordon the node)

This approach ensures that:

Healthy pods always result in uncordoned nodes
Multiple unhealthy pods on the same node are tracked independently
Each pod must become healthy (or be deleted) for the node to be uncordoned

Add the following configuration to your values.yaml:

1 # 1. Enable the component
2 global:
3   kubernetesObjectMonitor:
4     enabled: true
5 
6 # 2. Configure the policies
7 kubernetes-object-monitor:
8   maxConcurrentReconciles: 1
9   resyncPeriod: 5m
10   policies:
11     # Policy 1: Monitor GPU Operator DaemonSet Pods
12     - name: gpu-operator-pods-health
13       enabled: true
14       resource:
15         group: ""
16         version: v1
17         kind: Pod
18       predicate:
19         # Trigger event if:
20         # 1. Pod is in gpu-operator namespace
21         # 2. Pod is owned by a DaemonSet (we only monitor DaemonSet pods)
22         # 3. Pod has been assigned to a node (nodeName is set)
23         # 4. Pod is unhealthy: either NOT in Running/Succeeded state OR in CrashLoopBackOff
24         # 5. Pod has been running for at least the configured threshold (grace period)
25         #
26         # Note: CrashLoopBackOff pods have phase=Running but container is in Waiting state
27         # with reason=CrashLoopBackOff, so we must check containerStatuses explicitly.
28         expression: |
29           resource.metadata.namespace == 'gpu-operator' && 
30           has(resource.metadata.ownerReferences) &&
31           resource.metadata.ownerReferences.exists(r, r.kind == 'DaemonSet') &&
32           has(resource.spec.nodeName) && resource.spec.nodeName != "" &&
33           has(resource.status.startTime) &&
34           now - timestamp(resource.status.startTime) > duration('30m') &&
35           (
36             (resource.status.phase != 'Running' && resource.status.phase != 'Succeeded') ||
37             (
38               has(resource.status.containerStatuses) &&
39               resource.status.containerStatuses.exists(cs,
40                 has(cs.state.waiting) && 
41                 has(cs.state.waiting.reason) && 
42                 cs.state.waiting.reason == 'CrashLoopBackOff'
43               )
44             )
45           )
46       nodeAssociation:
47         expression: resource.spec.nodeName
48       healthEvent:
49         componentClass: Software
50         isFatal: true
51         message: "GPU Operator DaemonSet pod is not healthy"
52         recommendedAction: CONTACT_SUPPORT
53         errorCode:
54           - GPU_OPERATOR_POD_UNHEALTHY
55 
56     # Policy 2: Monitor Network Operator DaemonSet Pods
57     - name: network-operator-pod-health
58       enabled: true
59       resource:
60         group: ""
61         version: v1
62         kind: Pod
63       predicate:
64         expression: |
65           resource.metadata.namespace == 'network-operator' && 
66           has(resource.metadata.ownerReferences) &&
67           resource.metadata.ownerReferences.exists(r, r.kind == 'DaemonSet') &&
68           has(resource.spec.nodeName) && resource.spec.nodeName != "" &&
69           has(resource.status.startTime) &&
70           now - timestamp(resource.status.startTime) > duration('30m') &&
71           (
72             (resource.status.phase != 'Running' && resource.status.phase != 'Succeeded') ||
73             (
74               has(resource.status.containerStatuses) &&
75               resource.status.containerStatuses.exists(cs,
76                 has(cs.state.waiting) && 
77                 has(cs.state.waiting.reason) && 
78                 cs.state.waiting.reason == 'CrashLoopBackOff'
79               )
80             )
81           )
82       nodeAssociation:
83         expression: resource.spec.nodeName
84       healthEvent:
85         componentClass: Software
86         isFatal: true
87         message: "Network Operator DaemonSet pod is not healthy"
88         recommendedAction: CONTACT_SUPPORT
89         errorCode:
90           - NETWORK_OPERATOR_POD_UNHEALTHY

Detection Logic

The policy triggers when all of the following conditions are true:

Condition	Check
Namespace	Pod is in `gpu-operator` or `network-operator` namespace
DaemonSet owned	Pod has a DaemonSet owner reference
Node assigned	Pod has `spec.nodeName` set (scheduled to a node)
Time threshold	Pod has been running for more than the configured threshold
Unhealthy	Pod phase is NOT `Running`/`Succeeded` OR container in CrashLoopBackOff

Note: Only DaemonSet pods are monitored. Pods owned by ReplicaSets, Deployments, Jobs, or standalone pods are not monitored by these policies. This is because DaemonSet pods are the critical infrastructure components that affect GPU node health.

What This Catches

Stuck State	Pod Phase	Detected?
Stuck in init containers	`Pending`	Yes (phase check)
Init container `CrashLoopBackOff`	`Pending`	Yes (phase check)
Main container `CrashLoopBackOff`	`Running`	Yes (containerStatuses check)
`Pending` (scheduling/resource issues)	`Pending`	Yes (phase check)
`ImagePullBackOff` / `ErrImagePull`	`Pending`	Yes (phase check)
`Failed` phase	`Failed`	Yes (phase check)
Normal initialization (< threshold)	Any	No (grace period)
Healthy pod	`Running`	No (healthy)
Completed job	`Succeeded`	No (completed)

Pod Tracking Behavior

The kubernetes-object-monitor tracks each pod individually by name. This simple approach provides clear and predictable behavior:

Scenario: Pod Becomes Unhealthy

Pod enters unhealthy state (e.g., init container stuck, CrashLoopBackOff) → Node is cordoned after threshold
Pod becomes healthy (Running with all containers ready) → Node is uncordoned

Scenario: Main Container CrashLoopBackOff

Container crashes repeatedly → Pod phase stays Running, but container enters CrashLoopBackOff
Policy detects via containerStatuses check → Node is cordoned
Container is fixed and becomes healthy → Node is uncordoned

Scenario: Pod Deletion

Pod fails → Node is cordoned
Admin deletes the pod → Node is uncordoned
Replacement pod is created → If unhealthy, node is re-cordoned after threshold

Scenario: Multiple Unhealthy Pods

Pod A fails → Node is cordoned
Pod B also fails → Both tracked in annotation
Pod A becomes healthy → Node stays cordoned (Pod B still unhealthy)
Pod B becomes healthy → Node is uncordoned

State Key Format

The monitor uses a simple state key format: policyName/namespace/podName

This ensures each pod is tracked independently, and the node is only uncordoned when all tracked pods are healthy or deleted.

Configuration Options

Adjusting the Time Threshold

You can adjust the 30m (30 minutes) threshold based on your environment:

duration('10m') - 10 minutes (more aggressive, may cause false positives for slow image pulls)
duration('1h') - 1 hour (more lenient, delays detection of stuck pods)

Choose a value that exceeds your longest expected pod initialization time.

How the Time Check Works

The now - timestamp(resource.status.startTime) > duration('30m') expression:

now - Current timestamp (provided by the CEL environment)
timestamp(resource.status.startTime) - When the pod was created
duration('30m') - The threshold duration (30 minutes)

Resync Period

The resyncPeriod controls how often the monitor re-evaluates all resources:

Default: 5m (5 minutes)
For faster detection, reduce to 1m or 30s
Trade-off: Lower values increase API server load

1 kubernetes-object-monitor:
2   resyncPeriod: 1m  # Re-evaluate every minute

Troubleshooting

Investigating Health Events

If you receive these events, investigate the pod status:

Check pod status and events:

$ kubectl get pods -n gpu-operator -o wide
$ kubectl describe pod <pod-name> -n gpu-operator

Check container logs (if containers have started):

$ kubectl logs -n gpu-operator <pod-name>
$ # For init containers:
$ kubectl logs -n gpu-operator <pod-name> -c <init-container-name>

Common issues to look for:
- ImagePullBackOff - Check image name and registry credentials
- CrashLoopBackOff - Check container logs for crash reason
- Pending - Check node resources and scheduling constraints
- Init container stuck - Check init container logs

Verify node health:

$ kubectl get node <node-name>
$ kubectl describe node <node-name>

Common Issues

Problem: Pods are not being monitored (no health events)

Check if the pod matches the policy predicate:

$ kubectl get pod <pod-name> -n <namespace> -o yaml

Verify the pod is in the correct namespace and meets all predicate conditions.

Check kubernetes-object-monitor logs:

$ kubectl logs -n nvsentinel deployment/kubernetes-object-monitor

Verify the policy is enabled in your configuration.

Problem: Node stays cordoned after pod is healthy

Check if the DaemonSet still targets the node:

$ kubectl get ds <daemonset-name> -n <namespace> -o yaml | grep -A 10 nodeSelector

Check the policy match annotation on the node:

$ kubectl get node <node-name> -o jsonpath='{.metadata.annotations.nvsentinel\.dgxc\.nvidia\.com/policy-matches}'