Validating a Cluster
Task-oriented walkthrough for running aicr validate against a GPU cluster — from
capturing a snapshot through interpreting results. Covers both training and
inference workloads and all three validation phases (deployment, performance,
conformance).
For per-flag reference, see CLI reference: aicr validate. For the architectural view of how snapshot + recipe flow into the validator, see Data flow: Stage 3 Validate.
When to validate
Readiness pre-flight constraints (K8s version, OS, kernel) run implicitly before any phase. If pre-flight fails, no validator Jobs are deployed.
The workflow
- Snapshot — capture current cluster state (K8s / OS / GPU / topology) once.
- Recipe — generate the target configuration for your workload (training vs inference, platform, accelerator).
- Validate — run one or all phases against the snapshot and live cluster.
Prerequisites
aicrCLI installed (see installation).kubectlconfigured for the target cluster (validator dispatches K8s Jobs; pre-flight only needs the snapshot).- Cluster service account with RBAC to create Jobs, ConfigMaps, and read cluster state (AICR creates its own
aicr-validationnamespace on first run).
Training performance validation
Training performance runs an NCCL all-reduce benchmark — a Kubeflow TrainJob
that runs all_reduce_perf across GPU nodes and measures aggregate bus
bandwidth. Three check variants are available; the recipe picks the one (or
ones) that match the target fabric:
GB200/EKS recipes (both training and inference intents) enable -net and
-nvls together rather than the auto-detect variant, because those nodes
expose two inter-node fabrics simultaneously and a single auto-detect test
would only exercise one of them.
The generated recipe lists the selected variant(s) under
validation.performance.checks with a platform-tuned bandwidth constraint
(example: >= 300 GB/s for H100 + EFA; >= 40 GB/s NET and >= 500 GB/s
NVLS for GB200 + EFA, each sized for a 2-node pair).
Expected flow (~5–10 min per variant): readiness pre-flight → deploy
TrainingRuntime + TrainJob in aicr-validation → worker pods reach
Running → run all_reduce_perf → parse peak bus bandwidth → verify the
intended transport actually carried traffic (for -net / -nvls) → compare
to recipe constraint (10 % tolerance) → cleanup.
A passing CTRF entry:
Note: this guide does not yet list per-platform expected-bandwidth baselines (EKS + EFA, GKE + TCPXO, AKS, etc.). The recipe’s constraint value is the current pass/fail floor; measured values above that floor are treated as passing regardless of platform.
To run deployment validation first (recommended — verifies GPU Operator, DRA driver, and Kubeflow Trainer are installed and healthy before the benchmark):
Inference performance validation
Inference performance runs the inference-perf check — deploys a
DynamoGraphDeployment with a small vLLM-served model (Qwen/Qwen3-0.6B by
default) plus an AIPerf benchmark Job, and measures end-to-end output-token
throughput and time-to-first-token (TTFT) p99.
The generated recipe includes dynamo-platform in componentRefs and lists
inference-perf under validation.performance.checks with two constraints
— one per metric the check produces:
Expected flow (~5–7 min on H100): readiness pre-flight → deploy
ResourceClaimTemplate + DynamoGraphDeployment in a per-run namespace
aicr-inference-perf-<8-hex-suffix> → wait for state=successful (image pull
- model load) →
/healthprobe → AIPerf benchmark Job parses throughput + TTFT p99 → compare to recipe constraints (10 % tolerance) → cleanup.
All Dynamo Frontend and worker pods pin to a single GPU node via
kubernetes.io/hostname for a stable per-node baseline. On a shared cluster
where some GPUs on a candidate node are already held by another workload’s
DRA ResourceClaim, the validator picks the candidate with the most free
GPUs and sizes the benchmark to that count — so the check does not need an
explicit hostname override to avoid saturated nodes. Concurrent
aicr validate invocations are isolated from each other by the run-specific
suffix on both the namespace and the inner AIPerf Job name.
A passing CTRF entry (measured on EKS H100, 8 × H100 GPUs, Qwen/Qwen3-0.6B):
The RESULT: prefix on the first two lines is the contract documented in
pkg/validator/validator.go — any check that wants its summary lines echoed
to the CLI’s own output (not just the CTRF report) opts in by emitting that
prefix. The validator runtime strips the prefix when echoing; the full
prefixed line stays in stdout[].
To run deployment validation first (recommended — verifies GPU Operator, DRA driver, Dynamo operator, KAI scheduler, and supporting components are installed and healthy):
Skip scenarios
The inference validator has three explicit skip guards so it never runs where
it can’t succeed. Each produces a status: skipped CTRF entry with a specific
reason. Skipped checks are not failures: the validator container exits
with code 2 internally (mapped to CTRF skipped), but aicr validate itself
exits 0 for skipped/passed/other phases — a skipped inference check never
drives a non-zero CLI exit on its own.
Guards fire before any cluster mutation, so skips are cheap (typically < 10 s).
Running all phases
Phases run sequentially. If any phase fails, subsequent phases are skipped.
Scoping CNCF submission evidence to specific features
The --feature flag scopes which CNCF AI conformance features get behavioral
evidence collected. It only applies to the CNCF-submission evidence collector
and is rejected by the CLI unless --cncf-submission is also set (which in
turn requires --evidence-dir). It does not scope the regular
--phase conformance validator run — that one always evaluates every check
defined in the recipe.
Empty --feature (the default) collects evidence for every feature.
Valid feature names (from pkg/evidence/cncf/collector.go):
Input modes
Snapshot and recipe can come from a file, an HTTPS URL, or a Kubernetes ConfigMap:
The ConfigMap form is useful when the snapshot is captured by an in-cluster agent — see agent deployment.
Dry-run mode
--no-cluster runs the validator against the snapshot alone, skipping all
Kubernetes API calls. Declarative constraints still evaluate; behavioral checks
report skipped - no-cluster mode (test mode).
Useful for CI pipelines that validate a recipe against a captured snapshot without needing cluster access.
CI/CD integration
aicr validate exits non-zero when any phase fails. CTRF JSON is emitted to
stdout (or to --output <file>), so a pipeline can gate promotion on both the
exit code and the structured report:
Exit codes follow Unix conventions and are derived from the CLI’s structured
error codes (see pkg/errors/exitcode.go):
Important: two quirks to be aware of when gating a pipeline on exit code:
- Only phase status
faileddrives a non-zero exit. A phase whose status isother(check crashed, pod OOM,activeDeadlineSecondsexceeded) still produces exit 0. Pipelines that need to catch those outcomes must inspect the CTRF report and look at per-phase status or thesummary.othercount, not rely on exit code alone.- Exit 5 is narrower than it sounds. A timeout inside a check’s own logic (DynamoGraphDeployment not ready, inference endpoint never healthy, AIPerf Job pod-wait deadline) surfaces as a failed phase, not as a structured
ErrCodeTimeout, so the CLI exits 8. Only timeouts at the CLI-to-cluster layer (snapshot-agent wait, validator-Job wait) retain theirErrCodeTimeoutclassification all the way through to exit 5.
Scripts that gate on validation outcome should treat any non-zero code as
failure rather than branching on specific values, and should additionally
check CTRF summary.failed and summary.other for a complete picture.
For informational-only runs (report results without failing the build):
Troubleshooting
Readiness pre-flight fails
The CLI logs each readiness constraint comparison before any phase runs:
Fix: upgrade the cluster, or pick a recipe whose readiness constraints match the cluster’s actual versions.
Non-standard GPU labels or taints
Default GPU-node discovery looks for nodeGroup, node.kubernetes.io/instance-type, or GPU-related label substrings. If your cluster uses custom labels, override the scheduling of inner workloads with --node-selector and --toleration:
These flags affect the inner benchmark pods that run on GPU nodes (NCCL workers, Dynamo workers), not the validator orchestrator Job itself. For inference-perf specifically, --node-selector narrows the pool of candidate GPU nodes — the validator then picks the candidate with the most free GPUs (after accounting for in-use DRA allocations) and pins all Dynamo Frontend + worker pods to that node via kubernetes.io/hostname. The AIPerf benchmark runner pod is CPU-only, uses a tolerate-all / no-nodeSelector pod spec, and is unaffected by these flags.
A check reports skipped unexpectedly
Skips are always deliberate and always carry a reason, but the location of the reason in the CTRF entry depends on how the skip happened:
- Check-level skips (the CheckFunc ran and returned
validators.Skip(reason)— e.g., Guards A/B/C on inference,--no-clusterfrom inside a check): reason appears instdoutaslevel=INFO msg=SKIP reason="…". - Phase-level skips (the CheckFunc never ran — e.g., a prior phase failed, so subsequent phases synthesize skip entries; also
--no-clusterfor checks that the runner marks skipped before dispatch): reason appears inmessage, notstdout.
Common reasons and their cause:
ai-service-metrics fails with “Prometheus unreachable”
On EKS clusters that split worker and system pods across separate security
groups (e.g. DGXC EKS with distinct customer/system ENI subnets), the
conformance check ai-service-metrics can fail non-deterministically with:
The validator orchestrator Job tolerates every taint and has no node-affinity toward Prometheus, so the kube-scheduler may place it on any worker node — including one whose ENI is in a security group whose ingress to the Prometheus-hosting SG is missing or asymmetric. The outcome is not stable across re-runs: image-locality scoring tends to keep the pod on whatever node won the first scheduling decision, so a passing run on a fresh cluster does not prove the SG topology is correct.
This is a cluster-side prerequisite, not an AICR bug per se — see
EKS Dynamo Networking Prerequisites
for the SG ingress rules required for Prometheus (tcp/9090). The underlying
issue is tracked at #933.
Workaround when SG changes are not available: re-run the check until the orchestrator lands on a node whose SG can reach Prometheus, then leave the image cached there so image-locality keeps subsequent runs on the same node. This is unreliable and should not be used as the steady-state validation strategy.
Benchmark Job stuck or timed out
Each performance check has a Job-level activeDeadlineSeconds set by the catalog’s timeout:. For inference-perf, the full pipeline (workload ready → endpoint health → benchmark) can take up to 30 min on cold-start clusters. If it still times out:
Common causes: image pull throttling, vLLM model load slowness, and every
candidate GPU node being fully saturated by existing DRA (ResourceClaim)
allocations. In the saturated case the validator fails fast with a message
like no candidate GPU node has free GPUs — all N matched node(s) are saturated by existing DRA ResourceClaim allocations; the fix is to free
GPUs on one of the candidate nodes, or to pass
--node-selector kubernetes.io/hostname=<node> to target a specific node
you know is free. On clusters where the DRA API is not installed or the
validator’s service account cannot list resourceclaims, the check falls
back to sizing purely from Status.Allocatable["nvidia.com/gpu"] — which
does not account for in-use DRA devices and can leave the benchmark
Pending until timeout on a partially-occupied node.
Related
- CLI reference:
aicr validate— full flag reference and per-command examples - CLI reference:
aicr snapshot— snapshot capture options - CLI reference:
aicr recipe— recipe generation flags - Agent deployment — capture snapshots via an in-cluster Job
- Data flow: Stage 3 Validate — how the validator engine is built
- Validator Development Guide — add a new validator (contributor-facing)