Preflight configuration

View as Markdown

Preflight is a mutating admission webhook that injects GPU diagnostic init containers (DCGM diagnostics, optional NCCL loopback / all-reduce) into pods that request GPUs in namespaces you opt in via labels. It answers “is this GPU healthy enough to start?” before your workload runs—separate from continuous GPU health monitoring.

For architecture, tradeoffs, and metrics design, see ADR-026: Preflight checks.

Prerequisites

  • Helm subchart is off by default; enable with global.preflight.enabled (see below).
  • cert-manager (or OpenShift service CA) for webhook TLS—same expectation as the rest of the NVSentinel chart.
  • DCGM reachable from injected init containers (typically the NVIDIA GPU Operator’s DCGM / hostengine service). Configure the endpoint via DCGM_HOSTENGINE_ADDR on the preflight-dcgm-diag init container.
  • Multi-node / gang checks (e.g. preflight-nccl-allreduce): enable gang coordination and configure gang discovery for your scheduler (see below).

Enable preflight

  1. Set the global flag:
1global:
2 preflight:
3 enabled: true
  1. Configure the preflight subchart under the top-level preflight: key (values merge into distros/kubernetes/nvsentinel/charts/preflight/values.yaml). At minimum, review initContainers (including DCGM and NCCL env vars) and webhook.failurePolicy.

  2. Label namespaces where injection should apply:

$kubectl label namespace <namespace> nvsentinel.nvidia.com/preflight=enabled

The chart default namespaceSelector matches that label.

Init container placement

By default the webhook appends preflight init containers after any existing init containers in the pod spec. This ensures provider-injected setup containers (e.g., GCP TCPXO daemon) complete before preflight checks run.

Set initContainerPlacement to change this behavior:

1# "append" (default): add after existing init containers
2# "prepend": add before existing init containers
3initContainerPlacement: "prepend"

Use prepend when preflight checks must run before other init containers — for example, to gate workload setup on GPU health validation.

Per-pod check selection

By default, all init containers with defaultEnabled: true (or omitted, which defaults to true) are injected into every GPU pod. To select a subset of checks for a specific pod, annotate it:

1metadata:
2 annotations:
3 nvsentinel.nvidia.com/preflight-checks: "preflight-dcgm-diag,preflight-nccl-loopback"

Only the named containers are injected, in the order they appear in the annotation. Duplicate or unknown container names reject admission with an error.

An empty value disables all checks:

1nvsentinel.nvidia.com/preflight-checks: ""

When the annotation is absent, defaultEnabled on each init container controls whether it runs. For gang-aware checks (nccl-allreduce), all pods in the gang must have the same annotation value — mismatches are detected and fail fast before torchrun launches.

See ADR-034 for design details.

Init containers (check configuration)

The initContainers list in the preflight chart defines which checks the webhook injects. Each entry is a standard corev1.Container — you control images, env vars, resource limits, security contexts, and volume mounts.

The webhook automatically injects these env vars into every init container (you do not need to set them):

Env varSourcePurpose
NODE_NAMEDownward API (spec.nodeName)Kubernetes node name for health events
PLATFORM_CONNECTOR_SOCKETChart connectorSocketUnix socket for the platform-connector gRPC endpoint
PROCESSING_STRATEGYChart processingStrategyEXECUTE_REMEDIATION or STORE_ONLY — controls downstream action

For gang-aware containers the webhook also injects GANG_ID, GANG_CONFIG_DIR, GANG_TIMEOUT_SECONDS, and POD_NAME.

preflight-dcgm-diag

Runs DCGM diagnostics against every GPU allocated to the pod via the remote hostengine.

Env varDefaultDescription
DCGM_DIAG_LEVEL2Diagnostic depth: 1 = short (approx 30 s, software deployment checks), 2 = medium (approx 2 min, adds PCIe and basic GPU stress), 3 = long (approx 15 min, adds Diagnostic plugin stress), 4 = xlong (1-2 hr, extended stress)
DCGM_HOSTENGINE_ADDRnvidia-dcgm.gpu-operator.svc:5555DCGM hostengine gRPC endpoint

Example values override:

1initContainers:
2 - name: preflight-dcgm-diag
3 image:
4 repository: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag
5 tag: ""
6 env:
7 - name: DCGM_HOSTENGINE_ADDR
8 value: "nvidia-dcgm.gpu-operator.svc:5555"
9 - name: DCGM_DIAG_LEVEL
10 value: "2"
11 volumeMounts:
12 - name: nvsentinel-socket
13 mountPath: /var/run

preflight-nccl-loopback

Single-node NCCL all-reduce across all GPUs on the node. Validates intra-node interconnect (NVLink or PCIe).

Env varDefaultDescription
BW_THRESHOLD_GBPS150Minimum acceptable bus bandwidth in GB/s. NVLink interconnect typically sustains 150+ GB/s; set to approx 15 GB/s for PCIe interconnect
TEST_SIZE_MB256Message size in MB for the all-reduce benchmark
SKIP_BANDWIDTH_CHECKfalseWhen true, pass if the benchmark completes regardless of measured bandwidth

Example values override:

1initContainers:
2 - name: preflight-nccl-loopback
3 image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest
4 env:
5 - name: BW_THRESHOLD_GBPS
6 value: "15" # PCIe interconnect
7 - name: TEST_SIZE_MB
8 value: "512"

preflight-nccl-allreduce

Multi-node NCCL all-reduce across the entire gang. Requires gangCoordination.enabled: true and a gang-aware scheduler.

Env varDefaultDescription
BW_THRESHOLD_GBPS100Minimum acceptable bus bandwidth in GB/s
MESSAGE_SIZES4GComma-separated message sizes for the benchmark (e.g. "4G", "4G,8G"). Code default is 4G,8G; Helm chart overrides to 4G
BENCHMARK_ITERS20Number of timed iterations per message size
WARMUP_ITERS5Warmup iterations before timing begins
NCCL_REDUCE_OPsumReduction operation (sum, prod, min, max, avg)
SKIP_BANDWIDTH_CHECKfalsePass if benchmark completes regardless of bandwidth
NCCL_DEBUGNCCL log verbosity (INFO, WARN, etc.)
NCCL_DEBUG_SUBSYSNCCL subsystems to log (INIT,NET, etc.)

The container also requires IPC_LOCK capability for RDMA memory registration:

1initContainers:
2 - name: preflight-nccl-allreduce
3 image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:latest
4 securityContext:
5 capabilities:
6 add: ["IPC_LOCK"]
7 env:
8 - name: BW_THRESHOLD_GBPS
9 value: "100"
10 - name: MESSAGE_SIZES
11 value: "4G"

Fabric-specific NCCL configuration

In production the webhook automatically copies NCCL env vars and volume mounts from the pod’s main containers into preflight init containers using glob patterns:

1ncclEnvPatterns: ["NCCL_*", "FI_*", "LD_LIBRARY_PATH", "UCX_*", "TORCH_NCCL_*", "CUDA_DEVICE_ORDER"]
2volumeMountPatterns: ["host-opt-amazon*", "nvtcpxo-*", "nccl-*", "dev-shm"]

This means if your training container already has the correct NCCL_TOPO_FILE, FI_PROVIDER, or LD_LIBRARY_PATH, the preflight init containers inherit them with no manual configuration.

For standalone testing (e.g. busybox main container), use ncclAllreduceExtraEnv and gangCoordination.extraHostPathMounts to provide fabric config explicitly.

Gang discovery

Gang discovery identifies pods that belong to the same scheduling group so multi-node preflight checks (NCCL all-reduce) know their peers. A pod carries a “gang anchor”—a reference to a parent object—that holds gang metadata such as the minimum member count.

Two discovery mechanisms are supported:

Native Kubernetes (1.35+): workloadRef

The Workload resource (scheduling.k8s.io/v1alpha1) and spec.workloadRef are alpha in Kubernetes 1.35 and disabled by default. Enable the GenericWorkload feature gate on the API server and scheduler to use this path.

The default when gangDiscovery is left empty ({}). Each pod links to a Workload resource via spec.workloadRef:

1spec:
2 workloadRef:
3 name: training-job-workload
4 podGroup: workers

The Workload object contains a gang policy with minCount:

1apiVersion: scheduling.k8s.io/v1alpha1
2kind: Workload
3metadata:
4 name: training-job-workload
5spec:
6 podGroups:
7 - name: workers
8 policy:
9 gang:
10 minCount: 2

No gangDiscovery configuration is needed for this path.

PodGroup-based schedulers (Volcano, Run:ai / OSMO, and similar)

For schedulers that use PodGroup CRDs, configure gangDiscovery with:

FieldPurpose
nameDiscoverer identifier, used in the gang ID prefix and logging (e.g. "volcano")
annotationKeysPod annotation keys checked (in order) for the PodGroup name
labelKeysOptional pod label keys checked as fallback
podGroupGVRgroup, version, resource of the PodGroup CRD
minCountExprCEL expression to extract the minimum member count from the PodGroup object. Receives podGroup as the unstructured object. Default: "podGroup.spec.minMember"

Volcano example:

1gangDiscovery:
2 name: "volcano"
3 annotationKeys:
4 - "scheduling.k8s.io/group-name"
5 podGroupGVR:
6 group: "scheduling.volcano.sh"
7 version: "v1beta1"
8 resource: "podgroups"
9 minCountExpr: "podGroup.spec.minMember"

Volcano sets the scheduling.k8s.io/group-name annotation on each pod. The discoverer reads that annotation, fetches the corresponding PodGroup CRD, and evaluates minCountExpr to determine expected gang size.

OSMO + Kai scheduler example:

1gangDiscovery:
2 name: "osmo-with-kai"
3 labelKeys:
4 - "osmo.group_uuid"
5 podGroupGVR:
6 group: "scheduling.run.ai"
7 version: "v2alpha2"
8 resource: "podgroups"
9 minCountExpr: "podGroup.spec.minMember"

Here membership is determined by a pod label instead of an annotation. The rest of the flow is the same: look up the PodGroup CRD and extract minCount via CEL.

Gang coordination

When gangCoordination.enabled is true (default in the preflight chart), the controller coordinates multi-node checks through ConfigMaps:

  1. At admission time the webhook creates a skeleton ConfigMap for the gang and injects it as a volume mount on the pod’s preflight init containers.
  2. As pods become ready the gang controller populates the ConfigMap with peer information (IP, rank).
  3. Init containers read the ConfigMap at gangCoordination.configMapMountPath (default /etc/preflight) to discover the master address and peer list.

Each gang ConfigMap contains:

KeyValue
expected_countMinimum members needed (from the Workload / PodGroup CRD)
peersNewline-separated list of podName;podIP;rank
master_addrIP of the rank-0 pod
master_portPort for PyTorch distributed TCP bootstrap (default 29500)
gang_idUnique gang identifier (discoverer prefix + namespace + group)

ConfigMaps are labeled nvsentinel.nvidia.com/managed-by: preflight and named with a preflight- prefix.

Key gangCoordination values

1gangCoordination:
2 enabled: true
3 timeout: "10m" # Max wait for all members to register
4 masterPort: 29500 # PyTorch distributed bootstrap port
5 configMapMountPath: "/etc/preflight"
6
7 # Azure InfiniBand topology (required for NDv4/v5)
8 ncclTopoConfigMap: "" # Pre-existing ConfigMap name, or use ncclTopoShape
9 ncclTopoShape: "" # "ndv4" or "ndv5" to auto-create from bundled XML
10
11 extraHostPathMounts: [] # Host paths for NCCL/OFI/CUDA libraries
12 extraVolumeMounts: [] # Mount existing pod volumes (e.g. GCP TCPXO plugin)
13 # mirrorResourceClaims: true # Mirror DRA claims to init containers (default true)

For DRA / device claims mirrored into init containers, see ADR-026 §DRA Integration and mirrorResourceClaims above.

Key Helm values (subchart)

AreaLocation
Webhook TLS, failure policy, cert providerpreflight.webhook
Init container placement (append/prepend)preflight.initContainerPlacement
Injected init container images and envpreflight.initContainers
GPU / network resource namespreflight.gpuResourceNames, preflight.networkResourceNames
Copy NCCL / fabric env and mounts from user containerspreflight.ncclEnvPatterns, preflight.volumeMountPatterns
Gang discoverypreflight.gangDiscovery
Gang coordination (timeouts, topology, mounts)preflight.gangCoordination
Namespace selector for the webhookpreflight.namespaceSelector
Pod-level selector for the webhookpreflight.objectSelector

Object selector (pod-level filtering)

By default the webhook intercepts all GPU pods in labeled namespaces. To further restrict which pods are intercepted, set objectSelector with standard Kubernetes label selectors. When empty ({}), no objectSelector is emitted and all pods in matching namespaces are intercepted.

Example — only intercept pods explicitly labeled for preflight:

1objectSelector:
2 matchLabels:
3 nvsentinel.nvidia.com/preflight: "enabled"

matchExpressions are also supported:

1objectSelector:
2 matchExpressions:
3 - key: nvsentinel.nvidia.com/preflight
4 operator: In
5 values: ["enabled", "true"]

This is useful when you want namespace-wide opt-in via namespaceSelector but only run preflight on specific workloads within those namespaces.

Full defaults and comments: distros/kubernetes/nvsentinel/charts/preflight/values.yaml.

Tilt development often trims init containers to DCGM-only; see distros/kubernetes/nvsentinel/values-tilt.yaml.

Observability

  • Webhook pod: liveness/readiness probes use /healthz on the webhook port.
  • Prometheus metric names for check containers and the injector are specified in ADR-026 § Metrics; wire scrapers to your init container images and deployment as your environment allows.