Preflight
Overview
Preflight validates GPU health before your workload starts. It is a mutating admission webhook that injects diagnostic init containers into GPU pods, catching hardware and interconnect failures at pod creation time rather than mid-training.
Think of it as a pre-flight checklist for an aircraft — you want to know the engines work before takeoff, not after.
Why Do You Need This?
NVSentinel’s health monitors (GPU, syslog, CSP) continuously detect failures at runtime and trigger quarantine. Preflight complements them by adding a point-in-time gate at pod creation:
- Timing gap: A GPU can degrade between the last health-monitor poll and job startup. Preflight closes that window with an on-demand check right before the workload runs
- Interconnect coverage: Runtime monitors focus on individual GPU health. Preflight’s NCCL checks validate the actual communication path — NVLink, PCIe, InfiniBand, EFA — that a distributed job will use
- Fast failure: Without preflight, a bad interconnect is typically discovered minutes into training when NCCL operations hang or bandwidth drops. Preflight fails the pod in seconds, before any compute is wasted
- Multi-node validation: Single-node monitors can’t verify cross-node fabric health. The gang-aware
nccl-allreducecheck exercises the real multi-node path end-to-end
If a preflight check fails, the pod stays in Init:Error, a health event enters the standard NVSentinel pipeline, and the node proceeds through quarantine and remediation — the same workflow the runtime monitors use.
How It Works
Preflight runs as a Deployment with a mutating admission webhook:
- Namespace opt-in: Label namespaces with
nvsentinel.nvidia.com/preflight=enabled - Pod admission: When a GPU pod is created in a labeled namespace, the webhook intercepts the request. Optionally, an
objectSelectorcan further restrict which pods are intercepted based on pod labels - Init container injection: The webhook injects diagnostic init containers into the pod spec (appended after existing init containers by default; set
initContainerPlacement: prependto insert before) - Checks run: Init containers execute sequentially before the main workload starts
- Health reporting: Each check reports results as health events via the Platform Connector (gRPC over Unix domain socket)
- Pass/fail: If all checks pass (exit code 0), the main containers start normally. If any check fails, the pod stays in
Init:Errorand a health event triggers quarantine
Available Checks
All checks are optional — configure which ones to inject via initContainers in the Helm values.
Gang Coordination (Multi-Node Checks)
The preflight-nccl-allreduce check requires coordination across all nodes in a scheduling gang. Preflight handles this through:
- Gang discovery — identifies which pods belong to the same group using scheduler annotations/labels (Volcano, Run:ai/OSMO, or native Kubernetes WorkloadRef)
- ConfigMap-based coordination — the gang controller writes peer information (IPs, ranks) into a ConfigMap that init containers poll until all members are registered
- PyTorch distributed bootstrap — once all peers are known, the check uses
torchrunto execute a multi-node NCCL all-reduce benchmark
Gang coordination requires a gang-aware scheduler and gangCoordination.enabled: true (default).
Configuration
Enable preflight in the parent chart:
Key configuration areas:
For detailed configuration including per-check env vars, fabric-specific NCCL setup, and gang discovery examples, see Preflight configuration.
Related Documentation
- Preflight configuration guide — full Helm values reference
- ADR-026: Preflight checks — architecture and design rationale
- ADR-035: Inline DCGM config — design rationale for inline env var configuration
- GPU Health Monitor — continuous runtime GPU monitoring (complementary to preflight)