Monitoring Critical Operators (DaemonSet Pods in gpu-operator & network-operator Namespace)
Monitoring Critical Operators (DaemonSet Pods in gpu-operator & network-operator Namespace)
Overview
For an NVIDIA GPU cluster to function correctly, critical infrastructure components under gpu-operator and network-operator namespaces must be healthy. If these operators fail, the underlying hardware cannot be utilized effectively.
NVSentinel provides a built-in mechanism to monitor these operators and report health events when their pods are not running correctly.
Configuration
To monitor the GPU and Network operators, you must enable the kubernetes-object-monitor component and define the monitoring policies in your NVSentinel values.yaml.
These policies monitor DaemonSet pods in the gpu-operator and network-operator namespaces. A health event is generated if a DaemonSet pod:
- Has been assigned to a node, AND
- Is unhealthy: either not in
Running/Succeededstate, OR has a container inCrashLoopBackOff
The policy detects pods that are genuinely stuck in any non-progressing state such as:
- Stuck in init container execution (pod phase is
Pending) Pendingdue to resource constraintsCrashLoopBackOffin main containers (pod phase isRunningbut container is crashing)CrashLoopBackOffin init containers (pod phase isPending)ImagePullBackOfferrors- Any other state preventing the pod from becoming healthy
Pod Health Tracking (DaemonSet Only)
The policies track individual pods owned by daemonsets by name. When a pod’s health state changes:
- Pod becomes unhealthy → Node is cordoned
- Pod becomes healthy → Node is uncordoned
- Pod is deleted → Node is uncordoned (if a replacement pod comes up unhealthy, it will re-cordon the node)
This approach ensures that:
- Healthy pods always result in uncordoned nodes
- Multiple unhealthy pods on the same node are tracked independently
- Each pod must become healthy (or be deleted) for the node to be uncordoned
Add the following configuration to your values.yaml:
Detection Logic
The policy triggers when all of the following conditions are true:
Note: Only DaemonSet pods are monitored. Pods owned by ReplicaSets, Deployments, Jobs, or standalone pods are not monitored by these policies. This is because DaemonSet pods are the critical infrastructure components that affect GPU node health.
What This Catches
Pod Tracking Behavior
The kubernetes-object-monitor tracks each pod individually by name. This simple approach provides clear and predictable behavior:
Scenario: Pod Becomes Unhealthy
- Pod enters unhealthy state (e.g., init container stuck, CrashLoopBackOff) → Node is cordoned after threshold
- Pod becomes healthy (Running with all containers ready) → Node is uncordoned
Scenario: Main Container CrashLoopBackOff
- Container crashes repeatedly → Pod phase stays
Running, but container entersCrashLoopBackOff - Policy detects via containerStatuses check → Node is cordoned
- Container is fixed and becomes healthy → Node is uncordoned
Scenario: Pod Deletion
- Pod fails → Node is cordoned
- Admin deletes the pod → Node is uncordoned
- Replacement pod is created → If unhealthy, node is re-cordoned after threshold
Scenario: Multiple Unhealthy Pods
- Pod A fails → Node is cordoned
- Pod B also fails → Both tracked in annotation
- Pod A becomes healthy → Node stays cordoned (Pod B still unhealthy)
- Pod B becomes healthy → Node is uncordoned
State Key Format
The monitor uses a simple state key format: policyName/namespace/podName
This ensures each pod is tracked independently, and the node is only uncordoned when all tracked pods are healthy or deleted.
Configuration Options
Adjusting the Time Threshold
You can adjust the 30m (30 minutes) threshold based on your environment:
duration('10m')- 10 minutes (more aggressive, may cause false positives for slow image pulls)duration('1h')- 1 hour (more lenient, delays detection of stuck pods)
Choose a value that exceeds your longest expected pod initialization time.
How the Time Check Works
The now - timestamp(resource.status.startTime) > duration('30m') expression:
now- Current timestamp (provided by the CEL environment)timestamp(resource.status.startTime)- When the pod was createdduration('30m')- The threshold duration (30 minutes)
Resync Period
The resyncPeriod controls how often the monitor re-evaluates all resources:
- Default:
5m(5 minutes) - For faster detection, reduce to
1mor30s - Trade-off: Lower values increase API server load
Troubleshooting
Investigating Health Events
If you receive these events, investigate the pod status:
-
Check pod status and events:
-
Check container logs (if containers have started):
-
Common issues to look for:
ImagePullBackOff- Check image name and registry credentialsCrashLoopBackOff- Check container logs for crash reasonPending- Check node resources and scheduling constraints- Init container stuck - Check init container logs
-
Verify node health:
Common Issues
Problem: Pods are not being monitored (no health events)
-
Check if the pod matches the policy predicate:
Verify the pod is in the correct namespace and meets all predicate conditions.
-
Check kubernetes-object-monitor logs:
-
Verify the policy is enabled in your configuration.
Problem: Node stays cordoned after pod is healthy
-
Check if the DaemonSet still targets the node:
-
Check the policy match annotation on the node: