Monitoring Critical Operators (DaemonSet Pods in gpu-operator & network-operator Namespace)

View as Markdown

Overview

For an NVIDIA GPU cluster to function correctly, critical infrastructure components under gpu-operator and network-operator namespaces must be healthy. If these operators fail, the underlying hardware cannot be utilized effectively.

NVSentinel provides a built-in mechanism to monitor these operators and report health events when their pods are not running correctly.

Configuration

To monitor the GPU and Network operators, you must enable the kubernetes-object-monitor component and define the monitoring policies in your NVSentinel values.yaml.

These policies monitor DaemonSet pods in the gpu-operator and network-operator namespaces. A health event is generated if a DaemonSet pod:

  • Has been assigned to a node, AND
  • Is unhealthy: either not in Running/Succeeded state, OR has a container in CrashLoopBackOff

The policy detects pods that are genuinely stuck in any non-progressing state such as:

  • Stuck in init container execution (pod phase is Pending)
  • Pending due to resource constraints
  • CrashLoopBackOff in main containers (pod phase is Running but container is crashing)
  • CrashLoopBackOff in init containers (pod phase is Pending)
  • ImagePullBackOff errors
  • Any other state preventing the pod from becoming healthy

Pod Health Tracking (DaemonSet Only)

The policies track individual pods owned by daemonsets by name. When a pod’s health state changes:

  • Pod becomes unhealthy → Node is cordoned
  • Pod becomes healthy → Node is uncordoned
  • Pod is deleted → Node is uncordoned (if a replacement pod comes up unhealthy, it will re-cordon the node)

This approach ensures that:

  • Healthy pods always result in uncordoned nodes
  • Multiple unhealthy pods on the same node are tracked independently
  • Each pod must become healthy (or be deleted) for the node to be uncordoned

Add the following configuration to your values.yaml:

1# 1. Enable the component
2global:
3 kubernetesObjectMonitor:
4 enabled: true
5
6# 2. Configure the policies
7kubernetes-object-monitor:
8 maxConcurrentReconciles: 1
9 resyncPeriod: 5m
10 policies:
11 # Policy 1: Monitor GPU Operator DaemonSet Pods
12 - name: gpu-operator-pods-health
13 enabled: true
14 resource:
15 group: ""
16 version: v1
17 kind: Pod
18 predicate:
19 # Trigger event if:
20 # 1. Pod is in gpu-operator namespace
21 # 2. Pod is owned by a DaemonSet (we only monitor DaemonSet pods)
22 # 3. Pod has been assigned to a node (nodeName is set)
23 # 4. Pod is unhealthy: either NOT in Running/Succeeded state OR in CrashLoopBackOff
24 # 5. Pod has been running for at least the configured threshold (grace period)
25 #
26 # Note: CrashLoopBackOff pods have phase=Running but container is in Waiting state
27 # with reason=CrashLoopBackOff, so we must check containerStatuses explicitly.
28 expression: |
29 resource.metadata.namespace == 'gpu-operator' &&
30 has(resource.metadata.ownerReferences) &&
31 resource.metadata.ownerReferences.exists(r, r.kind == 'DaemonSet') &&
32 has(resource.spec.nodeName) && resource.spec.nodeName != "" &&
33 has(resource.status.startTime) &&
34 now - timestamp(resource.status.startTime) > duration('30m') &&
35 (
36 (resource.status.phase != 'Running' && resource.status.phase != 'Succeeded') ||
37 (
38 has(resource.status.containerStatuses) &&
39 resource.status.containerStatuses.exists(cs,
40 has(cs.state.waiting) &&
41 has(cs.state.waiting.reason) &&
42 cs.state.waiting.reason == 'CrashLoopBackOff'
43 )
44 )
45 )
46 nodeAssociation:
47 expression: resource.spec.nodeName
48 healthEvent:
49 componentClass: Software
50 isFatal: true
51 message: "GPU Operator DaemonSet pod is not healthy"
52 recommendedAction: CONTACT_SUPPORT
53 errorCode:
54 - GPU_OPERATOR_POD_UNHEALTHY
55
56 # Policy 2: Monitor Network Operator DaemonSet Pods
57 - name: network-operator-pod-health
58 enabled: true
59 resource:
60 group: ""
61 version: v1
62 kind: Pod
63 predicate:
64 expression: |
65 resource.metadata.namespace == 'network-operator' &&
66 has(resource.metadata.ownerReferences) &&
67 resource.metadata.ownerReferences.exists(r, r.kind == 'DaemonSet') &&
68 has(resource.spec.nodeName) && resource.spec.nodeName != "" &&
69 has(resource.status.startTime) &&
70 now - timestamp(resource.status.startTime) > duration('30m') &&
71 (
72 (resource.status.phase != 'Running' && resource.status.phase != 'Succeeded') ||
73 (
74 has(resource.status.containerStatuses) &&
75 resource.status.containerStatuses.exists(cs,
76 has(cs.state.waiting) &&
77 has(cs.state.waiting.reason) &&
78 cs.state.waiting.reason == 'CrashLoopBackOff'
79 )
80 )
81 )
82 nodeAssociation:
83 expression: resource.spec.nodeName
84 healthEvent:
85 componentClass: Software
86 isFatal: true
87 message: "Network Operator DaemonSet pod is not healthy"
88 recommendedAction: CONTACT_SUPPORT
89 errorCode:
90 - NETWORK_OPERATOR_POD_UNHEALTHY

Detection Logic

The policy triggers when all of the following conditions are true:

ConditionCheck
NamespacePod is in gpu-operator or network-operator namespace
DaemonSet ownedPod has a DaemonSet owner reference
Node assignedPod has spec.nodeName set (scheduled to a node)
Time thresholdPod has been running for more than the configured threshold
UnhealthyPod phase is NOT Running/Succeeded OR container in CrashLoopBackOff

Note: Only DaemonSet pods are monitored. Pods owned by ReplicaSets, Deployments, Jobs, or standalone pods are not monitored by these policies. This is because DaemonSet pods are the critical infrastructure components that affect GPU node health.

What This Catches

Stuck StatePod PhaseDetected?
Stuck in init containersPendingYes (phase check)
Init container CrashLoopBackOffPendingYes (phase check)
Main container CrashLoopBackOffRunningYes (containerStatuses check)
Pending (scheduling/resource issues)PendingYes (phase check)
ImagePullBackOff / ErrImagePullPendingYes (phase check)
Failed phaseFailedYes (phase check)
Normal initialization (< threshold)AnyNo (grace period)
Healthy podRunningNo (healthy)
Completed jobSucceededNo (completed)

Pod Tracking Behavior

The kubernetes-object-monitor tracks each pod individually by name. This simple approach provides clear and predictable behavior:

Scenario: Pod Becomes Unhealthy

  1. Pod enters unhealthy state (e.g., init container stuck, CrashLoopBackOff) → Node is cordoned after threshold
  2. Pod becomes healthy (Running with all containers ready) → Node is uncordoned

Scenario: Main Container CrashLoopBackOff

  1. Container crashes repeatedly → Pod phase stays Running, but container enters CrashLoopBackOff
  2. Policy detects via containerStatuses check → Node is cordoned
  3. Container is fixed and becomes healthy → Node is uncordoned

Scenario: Pod Deletion

  1. Pod fails → Node is cordoned
  2. Admin deletes the pod → Node is uncordoned
  3. Replacement pod is created → If unhealthy, node is re-cordoned after threshold

Scenario: Multiple Unhealthy Pods

  1. Pod A fails → Node is cordoned
  2. Pod B also fails → Both tracked in annotation
  3. Pod A becomes healthy → Node stays cordoned (Pod B still unhealthy)
  4. Pod B becomes healthy → Node is uncordoned

State Key Format

The monitor uses a simple state key format: policyName/namespace/podName

This ensures each pod is tracked independently, and the node is only uncordoned when all tracked pods are healthy or deleted.

Configuration Options

Adjusting the Time Threshold

You can adjust the 30m (30 minutes) threshold based on your environment:

  • duration('10m') - 10 minutes (more aggressive, may cause false positives for slow image pulls)
  • duration('1h') - 1 hour (more lenient, delays detection of stuck pods)

Choose a value that exceeds your longest expected pod initialization time.

How the Time Check Works

The now - timestamp(resource.status.startTime) > duration('30m') expression:

  • now - Current timestamp (provided by the CEL environment)
  • timestamp(resource.status.startTime) - When the pod was created
  • duration('30m') - The threshold duration (30 minutes)

Resync Period

The resyncPeriod controls how often the monitor re-evaluates all resources:

  • Default: 5m (5 minutes)
  • For faster detection, reduce to 1m or 30s
  • Trade-off: Lower values increase API server load
1kubernetes-object-monitor:
2 resyncPeriod: 1m # Re-evaluate every minute

Troubleshooting

Investigating Health Events

If you receive these events, investigate the pod status:

  1. Check pod status and events:

    $kubectl get pods -n gpu-operator -o wide
    $kubectl describe pod <pod-name> -n gpu-operator
  2. Check container logs (if containers have started):

    $kubectl logs -n gpu-operator <pod-name>
    $# For init containers:
    $kubectl logs -n gpu-operator <pod-name> -c <init-container-name>
  3. Common issues to look for:

    • ImagePullBackOff - Check image name and registry credentials
    • CrashLoopBackOff - Check container logs for crash reason
    • Pending - Check node resources and scheduling constraints
    • Init container stuck - Check init container logs
  4. Verify node health:

    $kubectl get node <node-name>
    $kubectl describe node <node-name>

Common Issues

Problem: Pods are not being monitored (no health events)

  1. Check if the pod matches the policy predicate:

    $kubectl get pod <pod-name> -n <namespace> -o yaml

    Verify the pod is in the correct namespace and meets all predicate conditions.

  2. Check kubernetes-object-monitor logs:

    $kubectl logs -n nvsentinel deployment/kubernetes-object-monitor
  3. Verify the policy is enabled in your configuration.

Problem: Node stays cordoned after pod is healthy

  1. Check if the DaemonSet still targets the node:

    $kubectl get ds <daemonset-name> -n <namespace> -o yaml | grep -A 10 nodeSelector
  2. Check the policy match annotation on the node:

    $kubectl get node <node-name> -o jsonpath='{.metadata.annotations.nvsentinel\.dgxc\.nvidia\.com/policy-matches}'