Preflight configuration
Preflight is a mutating admission webhook that injects GPU diagnostic init containers (DCGM diagnostics, optional NCCL loopback / all-reduce) into pods that request GPUs in namespaces you opt in via labels. It answers “is this GPU healthy enough to start?” before your workload runs—separate from continuous GPU health monitoring.
For architecture, tradeoffs, and metrics design, see ADR-026: Preflight checks.
Prerequisites
- Helm subchart is off by default; enable with
global.preflight.enabled(see below). - cert-manager (or OpenShift service CA) for webhook TLS—same expectation as the rest of the NVSentinel chart.
- DCGM reachable from injected init containers (typically the NVIDIA GPU Operator’s DCGM / hostengine service). Configure the endpoint via
DCGM_HOSTENGINE_ADDRon thepreflight-dcgm-diaginit container. - Multi-node / gang checks (e.g.
preflight-nccl-allreduce): enable gang coordination and configure gang discovery for your scheduler (see below).
Enable preflight
- Set the global flag:
-
Configure the preflight subchart under the top-level
preflight:key (values merge intodistros/kubernetes/nvsentinel/charts/preflight/values.yaml). At minimum, reviewinitContainers(including DCGM and NCCL env vars) andwebhook.failurePolicy. -
Label namespaces where injection should apply:
The chart default namespaceSelector matches that label.
Init container placement
By default the webhook appends preflight init containers after any existing init containers in the pod spec. This ensures provider-injected setup containers (e.g., GCP TCPXO daemon) complete before preflight checks run.
Set initContainerPlacement to change this behavior:
Use prepend when preflight checks must run before other init containers — for example, to gate workload setup on GPU health validation.
Per-pod check selection
By default, all init containers with defaultEnabled: true (or omitted, which defaults to true) are injected into every GPU pod. To select a subset of checks for a specific pod, annotate it:
Only the named containers are injected, in the order they appear in the annotation. Duplicate or unknown container names reject admission with an error.
An empty value disables all checks:
When the annotation is absent, defaultEnabled on each init container controls whether it runs. For gang-aware checks (nccl-allreduce), all pods in the gang must have the same annotation value — mismatches are detected and fail fast before torchrun launches.
See ADR-034 for design details.
Init containers (check configuration)
The initContainers list in the preflight chart defines which checks the webhook injects. Each entry is a standard corev1.Container plus preflight-specific controls such as defaultEnabled, inheritUserEnv, and inheritUserVolumeMounts — you control images, env vars, resource limits, security contexts, and volume mounts.
The webhook automatically injects these env vars into every init container (you do not need to set them):
For gang-aware containers the webhook also injects GANG_ID, GANG_CONFIG_DIR, GANG_TIMEOUT_SECONDS, and POD_NAME.
By default, the built-in checks use curated environments and do not inherit matching env vars or volume mounts from workload containers. To intentionally mirror workload NCCL/fabric configuration for a specific check, set inheritUserEnv: true and/or inheritUserVolumeMounts: true on that initContainers entry.
preflight-dcgm-diag
Runs DCGM diagnostics against every GPU allocated to the pod via the remote hostengine.
Example values override:
If a DCGM_ST_* status still prevents the diagnostic from completing after
retries, preflight-dcgm-diag emits a non-fatal unhealthy HealthEvent with
RecommendedAction=NONE and exits successfully so the workload is not blocked by
a preflight infrastructure failure.
preflight-nccl-loopback
Single-node NCCL all-reduce across all GPUs on the node. Validates intra-node interconnect (NVLink or PCIe).
Example values override:
preflight-nccl-allreduce
Multi-node NCCL all-reduce across the entire gang. Requires gangCoordination.enabled: true and a gang-aware scheduler.
The container also requires IPC_LOCK capability for RDMA memory registration:
Fabric-specific NCCL configuration
When a check opts in with inheritUserEnv or inheritUserVolumeMounts, the webhook copies matching NCCL env vars and volume mounts from the pod’s main containers using glob patterns:
This means if your training container already has the correct NCCL_TOPO_FILE, FI_PROVIDER, or LD_LIBRARY_PATH, an opted-in preflight init container can inherit them with no manual configuration.
Inheritance is per init container. The built-in checks set inheritUserEnv: false and inheritUserVolumeMounts: false by default to avoid workload-specific NCCL tuning poisoning preflight checks. Enable the flags only for checks that should intentionally mirror the workload environment:
For standalone testing (e.g. busybox main container), use ncclAllreduceExtraEnv and gangCoordination.extraHostPathMounts to provide fabric config explicitly.
Gang discovery
Gang discovery identifies pods that belong to the same scheduling group so multi-node preflight checks (NCCL all-reduce) know their peers. A pod carries a “gang anchor”—a reference to a parent object—that holds gang metadata such as the minimum member count.
Two discovery mechanisms are supported:
Native Kubernetes: schedulingGroup / workloadRef
The default when gangDiscovery is left empty (\{\}). Preflight first uses the Kubernetes 1.36 native PodGroup API when available, then falls back to the Kubernetes 1.35 native Workload API.
The
PodGroupresource (scheduling.k8s.io/v1alpha2) andspec.schedulingGroupare alpha in Kubernetes 1.36 and disabled by default. Enable theGenericWorkloadfeature gate on the API server and scheduler to use this path.
In Kubernetes 1.36, each pod links to a PodGroup resource via spec.schedulingGroup:
The PodGroup object contains a gang policy with minCount:
No gangDiscovery configuration is needed for this path.
In Kubernetes 1.35, each pod links to a native Workload resource via spec.workloadRef:
The Workload object contains pod groups and gang policy:
No gangDiscovery configuration is needed for this fallback path either.
The default chart RBAC grants read access to both native resources:
scheduling.k8s.io/podgroups for Kubernetes 1.36 and
scheduling.k8s.io/workloads for Kubernetes 1.35.
PodGroup-based schedulers (Volcano, Run:ai / OSMO, and similar)
For schedulers that use PodGroup CRDs, configure gangDiscovery with:
Volcano example:
Volcano sets the scheduling.k8s.io/group-name annotation on each pod. The discoverer reads that annotation, fetches the corresponding PodGroup CRD, and evaluates minCountExpr to determine expected gang size.
OSMO + Kai scheduler example:
Here membership is determined by a pod label instead of an annotation. The rest of the flow is the same: look up the PodGroup CRD and extract minCount via CEL.
Gang coordination
When gangCoordination.enabled is true (default in the preflight chart), the controller coordinates multi-node checks through ConfigMaps:
- At admission time the webhook creates a skeleton ConfigMap for the gang and injects it as a volume mount on the pod’s preflight init containers.
- As pods become ready the gang controller populates the ConfigMap with peer information (IP, rank).
- Init containers read the ConfigMap at
gangCoordination.configMapMountPath(default/etc/preflight) to discover the master address and peer list.
Each gang ConfigMap contains:
ConfigMaps are labeled nvsentinel.nvidia.com/managed-by: preflight and named with a preflight- prefix.
Key gangCoordination values
For DRA / device claims mirrored into init containers, see ADR-026 §DRA Integration and mirrorResourceClaims above.
Key Helm values (subchart)
Object selector (pod-level filtering)
By default the webhook intercepts all GPU pods in labeled namespaces. To further restrict which pods are intercepted, set objectSelector with standard Kubernetes label selectors. When empty (\{\}), no objectSelector is emitted and all pods in matching namespaces are intercepted.
Example — only intercept pods explicitly labeled for preflight:
matchExpressions are also supported:
This is useful when you want namespace-wide opt-in via namespaceSelector but only run preflight on specific workloads within those namespaces.
Full defaults and comments: distros/kubernetes/nvsentinel/charts/preflight/values.yaml.
Tilt development often trims init containers to DCGM-only; see distros/kubernetes/nvsentinel/values-tilt.yaml.
Observability
- Webhook pod: liveness/readiness probes use
/healthzon the webhook port. - Prometheus metric names for check containers and the injector are specified in ADR-026 § Metrics; wire scrapers to your init container images and deployment as your environment allows.
Related documentation
- ADR-026: Preflight checks
- ADR-035: Inline DCGM config — design rationale for inline env var configuration
- gRPC / TLS authentication (mentions preflight among webhooks)
- Helm chart README
- E2E test entry point:
tests/preflight_test.go(build tagamd64_group)