Validating a Cluster
Task-oriented walkthrough for running aicr validate against a GPU cluster — from
capturing a snapshot through interpreting results. Covers both training and
inference workloads and all three validation phases (deployment, conformance,
performance).
For per-flag reference, see CLI reference: aicr validate. For the architectural view of how snapshot + recipe flow into the validator, see Data flow: Stage 3 Validate.
When to validate
Readiness pre-flight constraints (K8s version, OS, kernel) run implicitly before any phase. If pre-flight fails, no validator Jobs are deployed.
The workflow
- Snapshot — capture current cluster state (K8s / OS / GPU / topology) once.
- Recipe — generate the target configuration for your workload (training vs inference, platform, accelerator).
- Validate — run one or all phases against the snapshot and live cluster.
Prerequisites
aicrCLI installed (see installation).kubectlconfigured for the target cluster (validator dispatches K8s Jobs; pre-flight only needs the snapshot).- Cluster service account with RBAC to create Jobs, ConfigMaps, and read cluster state (AICR creates its own
aicr-validationnamespace on first run).
Training performance validation
Training performance runs an NCCL all-reduce benchmark — a Kubeflow TrainJob
that runs all_reduce_perf across GPU nodes and measures aggregate bus
bandwidth. Three check variants are available; the recipe picks the one (or
ones) that match the target fabric:
GB200/EKS recipes (both training and inference intents) enable -net and
-nvls together rather than the auto-detect variant, because those nodes
expose two inter-node fabrics simultaneously and a single auto-detect test
would only exercise one of them.
GB200/OKE recipes enable -nvls only: OKE NET/RDMA stays out of the support
matrix until the OCI testbed proves a non-Socket NCCL transport end to end, so
OKE validates the NVL72 IMEX fabric without an EFA/NET counterpart.
The generated recipe lists the selected variant(s) under
validation.performance.checks with a platform-tuned bandwidth constraint
(example: >= 300 GB/s for H100 + EFA; >= 40 GB/s NET and >= 500 GB/s
NVLS for GB200 + EFA, each sized for a 2-node pair).
Node-shape assumption. These bus-bandwidth floors are fixed absolute values calibrated on full, high-bandwidth nodes (8-GPU H100 NVLink/SXM with multi-NIC transport). They are not normalized for node fabric or GPU count, so a smaller or different-fabric H100 SKU (e.g. a single-GPU-per-node shape) can false-fail a healthy run. Making the NCCL gate fabric/transport-class aware is tracked in #1256.
Expected flow (~5–10 min per variant): readiness pre-flight → deploy
TrainingRuntime + TrainJob in aicr-validation → worker pods reach
Running → run all_reduce_perf → parse peak bus bandwidth → verify the
intended transport actually carried traffic (for -net / -nvls) → compare
to recipe constraint (10 % tolerance) → cleanup.
A passing CTRF entry:
Note: this guide does not yet list per-platform expected-bandwidth baselines (EKS + EFA, GKE + TCPXO, AKS, etc.). The recipe’s constraint value is the current pass/fail floor; measured values above that floor are treated as passing regardless of platform.
To run deployment validation first (recommended — verifies GPU Operator, DRA driver, and Kubeflow Trainer are installed and healthy before the benchmark):
Inference performance validation
Inference performance runs the inference-perf check — deploys a
DynamoGraphDeployment with a vLLM-served model (Qwen/Qwen3-8B by default,
overridable per accelerator — see below) plus an AIPerf benchmark Job, and
measures end-to-end output-token throughput and time-to-first-token (TTFT) p99.
Warm-up: AIPerf sends a wave of warm-up requests before the measured run,
so vLLM’s one-time CUDA-graph / JIT compilation (tens of seconds on a cold
worker) is excluded from the reported throughput and p99 TTFT — the numbers
reflect steady state, not cold start. Warm-up scales with concurrency and is
tunable via AICR_INFERENCE_PERF_WARMUP_PER_CONCURRENCY (see the
validator reference).
Determinism: the benchmark is driven reproducibly so the verdict reflects the
deployment, not run-to-run RNG — a fixed random seed, fixed input/output token
counts (stddev 0), a pinned synthetic-prompt pool, and greedy decoding
(temperature: 0). Note that throughput (not the latency tail) is the stable,
discriminating signal at high concurrency; TTFT p99 near the saturation knee can
still vary with batching/scheduling, which is why the TTFT constraint is a
generous ceiling rather than a tight target.
The generated recipe includes dynamo-platform in componentRefs and lists
inference-perf under validation.performance.checks with pass/fail
constraints plus benchmark inputs:
Node-shape assumption. The inference-throughput floor is a fixed
absolute full-node value calibrated on a full node (the shared >= 50000
gate was measured on 8-GPU H100; GB200 on a 4-GPU node). It is not
normalized for GPU count, and the evaluator only scales it down for partial
occupancy — not across node sizes — so a smaller H100 SKU (e.g. 1-/2-GPU
shapes such as p5.4xlarge, AKS NC80adis) can false-fail a healthy run.
inference-ttft-p99 is a per-request latency at fixed concurrency-per-GPU
and does not need GPU-count normalization. A normalized per-GPU throughput
floor is tracked in #1254.
inference-model and inference-concurrency-per-gpu are optional: omit them to
use the compiled defaults (Qwen3-8B at 256 concurrent requests per GPU), set them
per overlay to tune model and load for each accelerator, or override globally
with the AICR_INFERENCE_PERF_MODEL / AICR_INFERENCE_PERF_CONCURRENCY_PER_GPU
catalog knobs (recipe wins over catalog env wins over default).
inference-routing-mode selects the Dynamo 1.2 Kubernetes routing path. The
default dynamo-router mode deploys a Dynamo frontend with KV-cache-aware
routing (DYN_ROUTER_MODE=kv). Normal frontend-to-worker request/response
traffic uses Dynamo’s request plane (Dynamo 1.2 defaults to TCP); AICR does not
set DYN_REQUEST_PLANE=nats. Workers publish local vLLM KV-cache events with
the vLLM ZMQ publisher and the Dynamo worker runtime relays those events onto
the NATS-backed event plane for the router to consume. Set it to gateway-epp
to exercise GAIE/EPP: the validator deploys an EPP component, worker frontend
sidecars in direct mode, and an HTTPRoute through the AICR-managed inference
gateway. The direct-mode sidecars honor EPP routing headers; they are not the
ZMQ-to-NATS relay.
Model-weights cache and AICR_INFERENCE_PERF_MODEL_CACHE_STORAGE_CLASS. The benchmark downloads
the model once into a PVC and serves all workers from it (on by default;
avoids per-IP Hugging Face throttling). The cache PVC needs a StorageClass: it
uses the cluster’s default StorageClass unless you set
AICR_INFERENCE_PERF_MODEL_CACHE_STORAGE_CLASS. On a cluster with no default
StorageClass (common on EKS — e.g. only a non-default gp2) and no value set,
the check fails fast in seconds with guidance rather than hanging; set
AICR_INFERENCE_PERF_MODEL_CACHE_STORAGE_CLASS=<name> (e.g. gp2/gp3 on EKS,
standard-rwo on GKE) on the inference-perf catalog entry’s env (or via a
catalog overlay in the aicr validate --data <dir> directory), or disable the cache with
AICR_INFERENCE_PERF_MODEL_CACHE_SIZE=off. Like the other
AICR_INFERENCE_PERF_* knobs, this is a catalog/--data setting — it is
not read from the shell environment of the process running aicr validate
(only HF_TOKEN is). AICR-deployed EKS clusters get a default gp3 StorageClass
from the aws-ebs-csi-driver component, so the cache works there with no knob.
Debugging a failed run with AICR_INFERENCE_PERF_NO_CLEANUP. By default the
validator deletes the per-run namespace (DGD, workers, frontend, AIPerf Job) on
both success and failure. To investigate a failure — e.g. a timed out waiting for inference endpoint to serve requests — set AICR_INFERENCE_PERF_NO_CLEANUP=1
and the validator leaves everything in place so you can kubectl logs the
frontend/workers and curl /v1/models and /v1/chat/completions live. Unlike the
other AICR_INFERENCE_PERF_* knobs, this one is read from the shell
environment of the process running aicr validate (forwarded to the
inference-perf pod, like HF_TOKEN), not from the catalog. Debug-only: you must
delete the aicr-inference-perf-<suffix> namespace manually afterward, or it
keeps GPU workers running.
Expected flow (~5–7 min on H100): readiness pre-flight → deploy
ResourceClaimTemplate + DynamoGraphDeployment in a per-run namespace
aicr-inference-perf-<8-hex-suffix> → wait for state=successful (image pull
- model load) →
/healthprobe → AIPerf benchmark Job parses throughput + TTFT p99 → compare to recipe constraints (10 % tolerance) → cleanup.
All Dynamo Frontend and worker pods pin to a single GPU node via
kubernetes.io/hostname for a stable per-node baseline. On a shared cluster
where some GPUs on a candidate node are already held by another workload’s
DRA ResourceClaim, the validator picks the candidate with the most free
GPUs and sizes the benchmark to that count — so the check does not need an
explicit hostname override to avoid saturated nodes. The inference-throughput
gate is a full-node baseline, so when the benchmark runs on fewer than the
node’s full GPU count the gate is scaled down by the same freeGPUs / nodeGPUs
fraction (throughput scales ~linearly at fixed concurrency-per-GPU) — a healthy
per-GPU result on a partially occupied node is not failed against a full-node
number. TTFT p99 is a per-request latency and is not scaled. Concurrent
aicr validate invocations are isolated from each other by the run-specific
suffix on both the namespace and the inner AIPerf Job name.
A passing CTRF entry (measured on EKS H100, 8 × H100 GPUs, Qwen/Qwen3-8B at 256 concurrency/GPU):
The RESULT: prefix on the first two lines is the contract documented in
pkg/validator/validator.go — any check that wants its summary lines echoed
to the CLI’s own output (not just the CTRF report) opts in by emitting that
prefix. The validator runtime strips the prefix when echoing; the full
prefixed line stays in stdout[].
To run deployment validation first (recommended — verifies GPU Operator, DRA driver, Dynamo operator, KAI scheduler, and supporting components are installed and healthy):
Skip scenarios
The inference validator has three explicit skip guards so it never runs where
it can’t succeed. Each produces a status: skipped CTRF entry with a specific
reason. Skipped checks are not failures: the validator container exits
with code 2 internally (mapped to CTRF skipped), but aicr validate itself
exits 0 for skipped/passed/other phases — a skipped inference check never
drives a non-zero CLI exit on its own.
Guards fire before any cluster mutation, so skips are cheap (typically < 10 s).
Running all phases
Phases run sequentially. By default all phases run and produce results
regardless of earlier failures. Pass --fail-fast to stop after the first
phase that fails (e.g., to skip a 65-minute inference-perf run when deployment
already failed).
Scoping CNCF submission evidence to specific features
The --feature flag scopes which CNCF AI conformance features get behavioral
evidence collected. It only applies to the CNCF-submission evidence collector
and is rejected by the CLI unless --cncf-submission is also set (which in
turn requires --evidence-dir). It does not scope the regular
--phase conformance validator run — that one always evaluates every check
defined in the recipe.
Empty --feature (the default) collects evidence for every feature.
Valid feature names (from pkg/evidence/cncf/collector.go):
Emitting recipe evidence
When a recipe PR targets hardware AICR maintainers cannot independently
re-run, the contributor needs to attach a signed evidence bundle so a
maintainer can verify the recipe offline. aicr validate produces the
bundle as a side effect when --emit-attestation is set; adding --push
signs it (cosign keyless via Sigstore) and uploads it to an OCI registry.
This is a different artifact from the CNCF-submission evidence above —
the two flag families produce independent outputs and may run from a
single aicr validate invocation.
The --push tag is just a human-readable label — the sha256: digest is
what pins the bundle, so tag choice never affects verification (the verifier
pulls by digest). Omit the tag, as above, and aicr derives a unique
per-recipe one, <recipe-slug>-<short-fingerprint> (e.g.
ghcr.io/<owner>/aicr-evidence:h100-eks-ubuntu-training-3f9a1c2b4d5e), so
distinct attestations never collide on a shared tag. Pass an explicit tag to
override.
After the command finishes:
Commit pointer.yaml to recipes/evidence/<recipe>.yaml; the bundle
itself lives in OCI. Then self-verify before opening the PR — the same
verifier runs against the committed pointer in the CI gate, so exit 0
locally means the gate will pass:
Flag reference:
Registry requirements: the registry must support the OCI 1.1 Referrers API (or its tag-schema fallback) so the Sigstore Bundle can be attached to the artifact. Known-good registries: GHCR, GitLab Container Registry, Harbor (≥ 2.8), AWS ECR, Google Artifact Registry, Azure Container Registry, JFrog Artifactory. Without referrer support the bundle pushes but the signature is not discoverable, and the verifier records signature-verify as “skipped (unsigned)” even on a signed bundle.
OIDC token resolution. --push resolves an identity token through
this precedence chain: --identity-token (or COSIGN_IDENTITY_TOKEN)
→ ambient GitHub Actions OIDC (ACTIONS_ID_TOKEN_REQUEST_URL
present) → --oidc-device-flow (or AICR_OIDC_DEVICE_FLOW=true) →
interactive browser. CI pipelines typically rely on the ambient
GitHub Actions path; local workstations get the browser flow.
Local-only mode (no registry access). Omitting --push still
produces a complete bundle on disk — the verifier records the
signature step as “skipped (unsigned)” and the manifest-hash chain
becomes self-consistency only. Useful for catching accidental
corruption during development, but unsuitable for the CI gate, which
requires a signed bundle bound to a pointer.
For the full producer-and-consumer walkthrough — including OCI-only
verification, the tamper demo, and JSON output for CI gates — see
Recipe Evidence Demo.
For the bundle format and verifier semantics, see
ADR-007.
For the maintainer-side review checklist, see
Maintaining Recipe Contributions.
For the per-flag reference on aicr evidence verify, see
CLI reference.
Input modes
Snapshot and recipe can come from a file, an HTTPS URL, or a Kubernetes ConfigMap:
The ConfigMap form is useful when the snapshot is captured by an in-cluster agent — see agent deployment.
Dry-run mode
--no-cluster runs the validator against the snapshot alone, skipping all
Kubernetes API calls. Declarative constraints still evaluate; behavioral checks
report skipped - no-cluster mode (test mode).
Useful for CI pipelines that validate a recipe against a captured snapshot without needing cluster access.
CI/CD integration
aicr validate exits non-zero when any phase fails. CTRF JSON is emitted to
stdout (or to --output <file>), so a pipeline can gate promotion on both the
exit code and the structured report:
Exit codes follow Unix conventions and are derived from the CLI’s structured
error codes (see pkg/errors/exitcode.go):
Important: two quirks to be aware of when gating a pipeline on exit code:
- Only phase status
faileddrives a non-zero exit. A phase whose status isother(check crashed, pod OOM,activeDeadlineSecondsexceeded) still produces exit 0. Pipelines that need to catch those outcomes must inspect the CTRF report and look at per-phase status or thesummary.othercount, not rely on exit code alone.- Exit 5 is narrower than it sounds. A timeout inside a check’s own logic (DynamoGraphDeployment not ready, inference endpoint never healthy, AIPerf Job pod-wait deadline) surfaces as a failed phase, not as a structured
ErrCodeTimeout, so the CLI exits 8. Only timeouts at the CLI-to-cluster layer (snapshot-agent wait, validator-Job wait) retain theirErrCodeTimeoutclassification all the way through to exit 5.
Scripts that gate on validation outcome should treat any non-zero code as
failure rather than branching on specific values, and should additionally
check CTRF summary.failed and summary.other for a complete picture.
For informational-only runs (report results without failing the build):
Troubleshooting
Readiness pre-flight fails
The CLI logs each readiness constraint comparison before any phase runs:
Fix: upgrade the cluster, or pick a recipe whose readiness constraints match the cluster’s actual versions.
Non-standard GPU labels or taints
Default GPU-node discovery looks for nodeGroup, node.kubernetes.io/instance-type, or GPU-related label substrings. If your cluster uses custom labels, override the scheduling of inner workloads with --node-selector and --toleration:
These flags affect the inner benchmark pods that run on GPU nodes (NCCL workers, Dynamo workers), not the validator orchestrator Job itself. For inference-perf specifically, --node-selector narrows the pool of candidate GPU nodes — the validator then picks the candidate with the most free GPUs (after accounting for in-use DRA allocations) and pins all Dynamo Frontend + worker pods to that node via kubernetes.io/hostname. The AIPerf benchmark runner pod is CPU-only, uses a tolerate-all / no-nodeSelector pod spec, and is unaffected by these flags.
A check reports skipped unexpectedly
Skips are always deliberate and always carry a reason, but the location of the reason in the CTRF entry depends on how the skip happened:
- Check-level skips (the CheckFunc ran and returned
validators.Skip(reason)— e.g., Guards A/B/C on inference,--no-clusterfrom inside a check): reason appears instdoutaslevel=INFO msg=SKIP reason="…". - Phase-level skips (the CheckFunc never ran — e.g., with
--fail-fast, a prior phase failed so subsequent phases synthesize skip entries; also--no-clusterfor checks that the runner marks skipped before dispatch): reason appears inmessage, notstdout.
Common reasons and their cause:
ai-service-metrics fails with “Prometheus unreachable”
On EKS clusters that split worker and system pods across separate security
groups (e.g. DGXC EKS with distinct customer/system ENI subnets), the
conformance check ai-service-metrics can fail non-deterministically with:
The validator orchestrator Job tolerates every taint and sets a preferred
dependencyAffinity toward Prometheus, so the scheduler co-locates it with the
Prometheus pod when possible. The preference is best-effort, so on fallback it
can still land on any worker node — including one whose ENI is in a security
group whose ingress to the Prometheus-hosting SG is missing or asymmetric. On
such a fallback the outcome is not stable across re-runs: image-locality
scoring tends to keep the pod on whatever node won the first scheduling
decision, so a passing run on a fresh cluster does not prove the SG topology is
correct.
This is a cluster-side prerequisite, not an AICR bug per se — see
EKS Dynamo Networking Prerequisites
for the SG ingress rules required for Prometheus (tcp/9090). The preferred
dependencyAffinity (#933,
resolved) makes a bad placement far less likely, but the 9090 SG rule remains
the reliable guarantee since the affinity is best-effort.
Workaround when SG changes are not available: re-run the check until the orchestrator lands on a node whose SG can reach Prometheus, then leave the image cached there so image-locality keeps subsequent runs on the same node. This is unreliable and should not be used as the steady-state validation strategy.
Benchmark Job stuck or timed out
Each performance check has a Job-level activeDeadlineSeconds set by the catalog’s timeout:. For inference-perf, the full pipeline (workload ready → endpoint health → benchmark) can take up to 30 min on cold-start clusters. If it still times out:
Common causes: image pull throttling, vLLM model load slowness, and every
candidate GPU node being fully saturated by existing DRA (ResourceClaim)
allocations. In the saturated case the validator fails fast with a message
like no candidate GPU node has free GPUs — all N matched node(s) are saturated by existing DRA ResourceClaim allocations; the fix is to free
GPUs on one of the candidate nodes, or to pass
--node-selector kubernetes.io/hostname=<node> to target a specific node
you know is free. On clusters where the DRA API is not installed or the
validator’s service account cannot list resourceclaims, the check falls
back to sizing purely from Status.Allocatable["nvidia.com/gpu"] — which
does not account for in-use DRA devices and can leave the benchmark
Pending until timeout on a partially-occupied node.
Related
- CLI reference:
aicr validate— full flag reference and per-command examples - CLI reference:
aicr snapshot— snapshot capture options - CLI reference:
aicr recipe— recipe generation flags - Agent deployment — capture snapshots via an in-cluster Job
- Data flow: Stage 3 Validate — how the validator engine is built
- Validator Development Guide — add a new validator (contributor-facing)