Preflight

View as Markdown

Overview

Preflight validates GPU health before your workload starts. It is a mutating admission webhook that injects diagnostic init containers into GPU pods, catching hardware and interconnect failures at pod creation time rather than mid-training.

Think of it as a pre-flight checklist for an aircraft — you want to know the engines work before takeoff, not after.

Why Do You Need This?

NVSentinel’s health monitors (GPU, syslog, CSP) continuously detect failures at runtime and trigger quarantine. Preflight complements them by adding a point-in-time gate at pod creation:

  • Timing gap: A GPU can degrade between the last health-monitor poll and job startup. Preflight closes that window with an on-demand check right before the workload runs
  • Interconnect coverage: Runtime monitors focus on individual GPU health. Preflight’s NCCL checks validate the actual communication path — NVLink, PCIe, InfiniBand, EFA — that a distributed job will use
  • Fast failure: Without preflight, a bad interconnect is typically discovered minutes into training when NCCL operations hang or bandwidth drops. Preflight fails the pod in seconds, before any compute is wasted
  • Multi-node validation: Single-node monitors can’t verify cross-node fabric health. The gang-aware nccl-allreduce check exercises the real multi-node path end-to-end

If a preflight check fails, the pod stays in Init:Error, a health event enters the standard NVSentinel pipeline, and the node proceeds through quarantine and remediation — the same workflow the runtime monitors use.

How It Works

Preflight runs as a Deployment with a mutating admission webhook:

  1. Namespace opt-in: Label namespaces with nvsentinel.nvidia.com/preflight=enabled
  2. Pod admission: When a GPU pod is created in a labeled namespace, the webhook intercepts the request. Optionally, an objectSelector can further restrict which pods are intercepted based on pod labels
  3. Init container injection: The webhook injects diagnostic init containers into the pod spec (appended after existing init containers by default; set initContainerPlacement: prepend to insert before)
  4. Checks run: Init containers execute sequentially before the main workload starts
  5. Health reporting: Each check reports results as health events via the Platform Connector (gRPC over Unix domain socket)
  6. Pass/fail: If all checks pass (exit code 0), the main containers start normally. If any check fails, the pod stays in Init:Error and a health event triggers quarantine

Available Checks

CheckScopeWhat it validatesApproximate duration
preflight-dcgm-diagSingle nodeGPU hardware health via DCGM diagnostics (ECC, PCIe, thermal, stress)30s–15min depending on diag level
preflight-nccl-loopbackSingle nodeIntra-node GPU-to-GPU interconnect (NVLink or PCIe) via NCCL all-reduce~5s
preflight-nccl-allreduceMulti-node (gang)Cross-node GPU communication over the cluster fabric (InfiniBand, EFA, TCPXO)~30s + gang wait time

All checks are optional — configure which ones to inject via initContainers in the Helm values.

Gang Coordination (Multi-Node Checks)

The preflight-nccl-allreduce check requires coordination across all nodes in a scheduling gang. Preflight handles this through:

  1. Gang discovery — identifies which pods belong to the same group using scheduler annotations/labels (Volcano, Run:ai/OSMO, or native Kubernetes WorkloadRef)
  2. ConfigMap-based coordination — the gang controller writes peer information (IPs, ranks) into a ConfigMap that init containers poll until all members are registered
  3. PyTorch distributed bootstrap — once all peers are known, the check uses torchrun to execute a multi-node NCCL all-reduce benchmark

Gang coordination requires a gang-aware scheduler and gangCoordination.enabled: true (default).

Configuration

Enable preflight in the parent chart:

1global:
2 preflight:
3 enabled: true

Key configuration areas:

AreaDescription
preflight.initContainersWhich checks to inject, their images, env vars, and resource limits. DCGM config (DCGM_HOSTENGINE_ADDR, DCGM_DIAG_LEVEL) is defined as env vars on the preflight-dcgm-diag container
preflight.gangDiscoveryScheduler-specific gang identification (Volcano, Run:ai, native K8s)
preflight.gangCoordinationMulti-node coordination timeouts, NCCL topology, extra mounts
preflight.webhookTLS, failure policy, cert provider
preflight.namespaceSelectorWhich namespaces the webhook applies to
preflight.objectSelectorWhich pods the webhook applies to (label-based filtering within selected namespaces)

For detailed configuration including per-check env vars, fabric-specific NCCL setup, and gang discovery examples, see Preflight configuration.