Validating a Cluster

View as MarkdownOpen in Claude

Task-oriented walkthrough for running aicr validate against a GPU cluster — from capturing a snapshot through interpreting results. Covers both training and inference workloads and all three validation phases (deployment, performance, conformance).

For per-flag reference, see CLI reference: aicr validate. For the architectural view of how snapshot + recipe flow into the validator, see Data flow: Stage 3 Validate.

When to validate

PhaseWhat it answersTypical trigger
deploymentAre the components the recipe asks for actually installed and healthy?After ./deploy.sh finishes, before running any workload
performanceDoes the cluster hit expected bandwidth / throughput thresholds?After components are ready; before going to production
conformanceDoes the cluster support workload-specific capabilities (DRA, gang scheduling, autoscaling, …)?Before opening the cluster to real workloads

Readiness pre-flight constraints (K8s version, OS, kernel) run implicitly before any phase. If pre-flight fails, no validator Jobs are deployed.

The workflow

aicr snapshot ─┐
├─▶ aicr validate ─▶ CTRF report
aicr recipe ───┘ (passed / failed / skipped per check)
  1. Snapshot — capture current cluster state (K8s / OS / GPU / topology) once.
  2. Recipe — generate the target configuration for your workload (training vs inference, platform, accelerator).
  3. Validate — run one or all phases against the snapshot and live cluster.

Prerequisites

  • aicr CLI installed (see installation).
  • kubectl configured for the target cluster (validator dispatches K8s Jobs; pre-flight only needs the snapshot).
  • Cluster service account with RBAC to create Jobs, ConfigMaps, and read cluster state (AICR creates its own aicr-validation namespace on first run).

Training performance validation

Training performance runs an NCCL all-reduce benchmark — a Kubeflow TrainJob that runs all_reduce_perf across GPU nodes and measures aggregate bus bandwidth. Three check variants are available; the recipe picks the one (or ones) that match the target fabric:

CheckTransportWhen it’s selected
nccl-all-reduce-bwAuto-detect (whatever NCCL picks)Default for H100 on EKS/GKE, and for GB200/B200 on non-EKS services. Preserves the pre-variant behavior.
nccl-all-reduce-bw-netNET (EFA on EKS)GB200 + EKS. Asserts EFA actually carried traffic — catches silent fallback to Socket when the NVIDIA driver is missing NVreg_GrdmaPciTopoCheckOverride=1.
nccl-all-reduce-bw-nvlsNVLS (MNNVL across an NVL72 IMEX domain)GB200 + EKS. Asserts the NVLS communicator actually initialized — catches silent fallback to EFA when the IMEX domain is misconfigured.

GB200/EKS recipes (both training and inference intents) enable -net and -nvls together rather than the auto-detect variant, because those nodes expose two inter-node fabrics simultaneously and a single auto-detect test would only exercise one of them.

$# Capture snapshot, generate training recipe, validate the performance phase.
$aicr snapshot --output snapshot.yaml
$
$aicr recipe --service eks --accelerator h100 --os ubuntu \
> --intent training --platform kubeflow \
> --output recipe.yaml
$
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase performance

The generated recipe lists the selected variant(s) under validation.performance.checks with a platform-tuned bandwidth constraint (example: >= 300 GB/s for H100 + EFA; >= 40 GB/s NET and >= 500 GB/s NVLS for GB200 + EFA, each sized for a 2-node pair).

Expected flow (~5–10 min per variant): readiness pre-flight → deploy TrainingRuntime + TrainJob in aicr-validation → worker pods reach Running → run all_reduce_perf → parse peak bus bandwidth → verify the intended transport actually carried traffic (for -net / -nvls) → compare to recipe constraint (10 % tolerance) → cleanup.

A passing CTRF entry:

1{
2 "name": "nccl-all-reduce-bw-net",
3 "status": "passed",
4 "suite": ["performance"],
5 "stdout": [
6 "NCCL All Reduce bandwidth (nccl-all-reduce-bw-net): <actual> GB/s",
7 "Constraint: >= <threshold> → true"
8 ]
9}

Note: this guide does not yet list per-platform expected-bandwidth baselines (EKS + EFA, GKE + TCPXO, AKS, etc.). The recipe’s constraint value is the current pass/fail floor; measured values above that floor are treated as passing regardless of platform.

To run deployment validation first (recommended — verifies GPU Operator, DRA driver, and Kubeflow Trainer are installed and healthy before the benchmark):

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

Inference performance validation

Inference performance runs the inference-perf check — deploys a DynamoGraphDeployment with a small vLLM-served model (Qwen/Qwen3-0.6B by default) plus an AIPerf benchmark Job, and measures end-to-end output-token throughput and time-to-first-token (TTFT) p99.

$# Capture snapshot, generate inference recipe, validate the performance phase.
$aicr snapshot --output snapshot.yaml
$
$aicr recipe --service eks --accelerator h100 --os ubuntu \
> --intent inference --platform dynamo \
> --output recipe.yaml
$
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase performance

The generated recipe includes dynamo-platform in componentRefs and lists inference-perf under validation.performance.checks with two constraints — one per metric the check produces:

1validation:
2 performance:
3 checks: [inference-perf]
4 constraints:
5 - name: inference-throughput # output tokens/sec
6 value: ">= 5000"
7 - name: inference-ttft-p99 # time-to-first-token p99 in ms
8 value: "<= 200"

Expected flow (~5–7 min on H100): readiness pre-flight → deploy ResourceClaimTemplate + DynamoGraphDeployment in a per-run namespace aicr-inference-perf-<8-hex-suffix> → wait for state=successful (image pull

  • model load) → /health probe → AIPerf benchmark Job parses throughput + TTFT p99 → compare to recipe constraints (10 % tolerance) → cleanup.

All Dynamo Frontend and worker pods pin to a single GPU node via kubernetes.io/hostname for a stable per-node baseline. On a shared cluster where some GPUs on a candidate node are already held by another workload’s DRA ResourceClaim, the validator picks the candidate with the most free GPUs and sizes the benchmark to that count — so the check does not need an explicit hostname override to avoid saturated nodes. Concurrent aicr validate invocations are isolated from each other by the run-specific suffix on both the namespace and the inner AIPerf Job name.

A passing CTRF entry (measured on EKS H100, 8 × H100 GPUs, Qwen/Qwen3-0.6B):

1{
2 "name": "inference-perf",
3 "status": "passed",
4 "suite": ["performance"],
5 "stdout": [
6 "RESULT: Inference throughput: 38367.28 tokens/sec",
7 "RESULT: Inference TTFT p99: 127.90 ms",
8 "Throughput constraint: >= 5000 → PASS",
9 "TTFT p99 constraint: <= 200 → PASS"
10 ]
11}

The RESULT: prefix on the first two lines is the contract documented in pkg/validator/validator.go — any check that wants its summary lines echoed to the CLI’s own output (not just the CTRF report) opts in by emitting that prefix. The validator runtime strips the prefix when echoing; the full prefixed line stays in stdout[].

To run deployment validation first (recommended — verifies GPU Operator, DRA driver, Dynamo operator, KAI scheduler, and supporting components are installed and healthy):

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

Skip scenarios

The inference validator has three explicit skip guards so it never runs where it can’t succeed. Each produces a status: skipped CTRF entry with a specific reason. Skipped checks are not failures: the validator container exits with code 2 internally (mapped to CTRF skipped), but aicr validate itself exits 0 for skipped/passed/other phases — a skipped inference check never drives a non-zero CLI exit on its own.

GuardTriggerSkip message
ARecipe lists inference-perf in checks: but no matching inference-throughput / inference-ttft-p99 constraintsno inference-throughput or inference-ttft-p99 constraint in recipe
Binference-perf is selected but dynamo-platform is not in recipe componentRefsskipped - dynamo-platform not in recipe components
Cdynamo-platform is declared but the DynamoGraphDeployment CRD is not installed on the cluster (operator not deployed yet)skipped - DynamoGraphDeployment CRD not installed on cluster (dynamo-platform component declared but operator not deployed yet)

Guards fire before any cluster mutation, so skips are cheap (typically < 10 s).

Running all phases

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
$# equivalent to: --phase deployment --phase performance --phase conformance

Phases run sequentially. If any phase fails, subsequent phases are skipped.

Scoping CNCF submission evidence to specific features

The --feature flag scopes which CNCF AI conformance features get behavioral evidence collected. It only applies to the CNCF-submission evidence collector and is rejected by the CLI unless --cncf-submission is also set (which in turn requires --evidence-dir). It does not scope the regular --phase conformance validator run — that one always evaluates every check defined in the recipe.

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \
> --phase conformance \
> --cncf-submission \
> --evidence-dir ./evidence \
> --feature dra-support --feature gang-scheduling

Empty --feature (the default) collects evidence for every feature.

Valid feature names (from pkg/evidence/cncf/collector.go):

NameWhat it checks
dra-supportDynamic Resource Allocation driver and ResourceSlices
gang-schedulingGang-scheduler presence and PodGroup support
secure-accessCluster authn/authz posture for AI workloads
accelerator-metricsGPU metrics exporter and Prometheus scrape config
ai-service-metricsInference-service metrics via custom-metrics API
inference-gatewayGateway API + Inference Extension installation
robust-operatorOperator readiness and leader-election posture
pod-autoscalingHPA / custom-metrics-driven pod autoscaling
cluster-autoscalingKarpenter (preferred) or EKS managed node-group autoscaling fallback

Input modes

Snapshot and recipe can come from a file, an HTTPS URL, or a Kubernetes ConfigMap:

$# File (default)
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
$
$# HTTPS URL
$aicr validate \
> --recipe https://artifacts.example.com/recipes/h100-eks-inference.yaml \
> --snapshot https://artifacts.example.com/snapshots/prod-cluster.yaml
$
$# Kubernetes ConfigMap (for in-cluster operators)
$aicr validate \
> --recipe cm://gpu-operator/aicr-recipe \
> --snapshot cm://gpu-operator/aicr-snapshot

The ConfigMap form is useful when the snapshot is captured by an in-cluster agent — see agent deployment.

Dry-run mode

--no-cluster runs the validator against the snapshot alone, skipping all Kubernetes API calls. Declarative constraints still evaluate; behavioral checks report skipped - no-cluster mode (test mode).

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster

Useful for CI pipelines that validate a recipe against a captured snapshot without needing cluster access.

CI/CD integration

aicr validate exits non-zero when any phase fails. CTRF JSON is emitted to stdout (or to --output <file>), so a pipeline can gate promotion on both the exit code and the structured report:

$aicr validate \
> --recipe recipe.yaml \
> --snapshot cm://gpu-operator/aicr-snapshot \
> --output ctrf.json

Exit codes follow Unix conventions and are derived from the CLI’s structured error codes (see pkg/errors/exitcode.go):

CodeMeaning
0All phases reported status passed, skipped, or other
2Invalid input or request (ErrCodeInvalidRequest) — bad CLI flag, malformed argument, or a validator rejecting a recipe value (e.g., an inference constraint that uses the wrong comparator direction)
5CLI-layer timeout before a check runs — snapshot-agent Job never completes within --timeout, or the validator Job as a whole exceeds its wait deadline
8One or more phases reported status failed, including per-check internal timeouts (e.g., DynamoGraphDeployment not ready within InferenceWorkloadReadyTimeout)

Important: two quirks to be aware of when gating a pipeline on exit code:

  1. Only phase status failed drives a non-zero exit. A phase whose status is other (check crashed, pod OOM, activeDeadlineSeconds exceeded) still produces exit 0. Pipelines that need to catch those outcomes must inspect the CTRF report and look at per-phase status or the summary.other count, not rely on exit code alone.
  2. Exit 5 is narrower than it sounds. A timeout inside a check’s own logic (DynamoGraphDeployment not ready, inference endpoint never healthy, AIPerf Job pod-wait deadline) surfaces as a failed phase, not as a structured ErrCodeTimeout, so the CLI exits 8. Only timeouts at the CLI-to-cluster layer (snapshot-agent wait, validator-Job wait) retain their ErrCodeTimeout classification all the way through to exit 5.

Scripts that gate on validation outcome should treat any non-zero code as failure rather than branching on specific values, and should additionally check CTRF summary.failed and summary.other for a complete picture.

For informational-only runs (report results without failing the build):

$aicr validate ... --fail-on-error=false

Troubleshooting

Readiness pre-flight fails

The CLI logs each readiness constraint comparison before any phase runs:

readiness constraint failed: name=K8s.server.version expected=">= 1.34" actual=v1.33.0-eks-abc

Fix: upgrade the cluster, or pick a recipe whose readiness constraints match the cluster’s actual versions.

Non-standard GPU labels or taints

Default GPU-node discovery looks for nodeGroup, node.kubernetes.io/instance-type, or GPU-related label substrings. If your cluster uses custom labels, override the scheduling of inner workloads with --node-selector and --toleration:

$aicr validate \
> --recipe recipe.yaml --snapshot snapshot.yaml --phase performance \
> --node-selector my-org/gpu-pool=h100 \
> --toleration dedicated=worker-workload:NoSchedule \
> --toleration dedicated=worker-workload:NoExecute

These flags affect the inner benchmark pods that run on GPU nodes (NCCL workers, Dynamo workers), not the validator orchestrator Job itself. For inference-perf specifically, --node-selector narrows the pool of candidate GPU nodes — the validator then picks the candidate with the most free GPUs (after accounting for in-use DRA allocations) and pins all Dynamo Frontend + worker pods to that node via kubernetes.io/hostname. The AIPerf benchmark runner pod is CPU-only, uses a tolerate-all / no-nodeSelector pod spec, and is unaffected by these flags.

A check reports skipped unexpectedly

Skips are always deliberate and always carry a reason, but the location of the reason in the CTRF entry depends on how the skip happened:

  • Check-level skips (the CheckFunc ran and returned validators.Skip(reason) — e.g., Guards A/B/C on inference, --no-cluster from inside a check): reason appears in stdout as level=INFO msg=SKIP reason="…".
  • Phase-level skips (the CheckFunc never ran — e.g., a prior phase failed, so subsequent phases synthesize skip entries; also --no-cluster for checks that the runner marks skipped before dispatch): reason appears in message, not stdout.

Common reasons and their cause:

Reason (excerpt)Where it appearsMeaningFix
no inference-throughput or inference-ttft-p99 constraint in recipestdoutCheck was invoked but recipe is missing the matching constraintsRe-generate the recipe or add the constraints
dynamo-platform not in recipe componentsstdoutInference check selected but dynamo-platform absent from componentRefsUse --platform dynamo when generating the recipe
DynamoGraphDeployment CRD not installedstdoutRecipe declares dynamo-platform but the operator is not deployedRun aicr bundle + ./deploy.sh first, or wait for bootstrap to complete
skipped - no-cluster modemessage--no-cluster was passed — the runner short-circuits every phase before dispatching any JobRemove the flag to run behavioral checks
skipped due to previous phase failuremessageAn earlier phase failed and subsequent phases are skippedFix the earlier phase first, then re-run

ai-service-metrics fails with “Prometheus unreachable”

On EKS clusters that split worker and system pods across separate security groups (e.g. DGXC EKS with distinct customer/system ENI subnets), the conformance check ai-service-metrics can fail non-deterministically with:

[SERVICE_UNAVAILABLE] Prometheus unreachable at http://kube-prometheus-prometheus.monitoring.svc:9090 — verify network connectivity

The validator orchestrator Job tolerates every taint and has no node-affinity toward Prometheus, so the kube-scheduler may place it on any worker node — including one whose ENI is in a security group whose ingress to the Prometheus-hosting SG is missing or asymmetric. The outcome is not stable across re-runs: image-locality scoring tends to keep the pod on whatever node won the first scheduling decision, so a passing run on a fresh cluster does not prove the SG topology is correct.

This is a cluster-side prerequisite, not an AICR bug per se — see EKS Dynamo Networking Prerequisites for the SG ingress rules required for Prometheus (tcp/9090). The underlying issue is tracked at #933.

Workaround when SG changes are not available: re-run the check until the orchestrator lands on a node whose SG can reach Prometheus, then leave the image cached there so image-locality keeps subsequent runs on the same node. This is unreliable and should not be used as the steady-state validation strategy.

Benchmark Job stuck or timed out

Each performance check has a Job-level activeDeadlineSeconds set by the catalog’s timeout:. For inference-perf, the full pipeline (workload ready → endpoint health → benchmark) can take up to 30 min on cold-start clusters. If it still times out:

$# validator orchestrator Job + AIPerf benchmark Job both live in aicr-validation.
$# The orchestrator is named aicr-inference-perf-<hex> (random suffix per run);
$# the AIPerf Job is named aicr-aiperf-<run-id-hash>.
$kubectl -n aicr-validation get jobs | grep -E 'aicr-inference-perf-|aicr-aiperf-'
$
$# tail each by full job name (label selectors require exact match)
$kubectl -n aicr-validation logs -l job-name=aicr-inference-perf-<hash> --tail=200
$kubectl -n aicr-validation logs -l job-name=aicr-aiperf-<run-id-hash> --tail=200
$
$# the Dynamo workload (DynamoGraphDeployment, Frontend, worker pods,
># ResourceClaimTemplate) lives in a separate per-run namespace:
$kubectl get ns | grep aicr-inference-perf-
$kubectl -n aicr-inference-perf-<suffix> get dynamographdeployments,pods,svc

Common causes: image pull throttling, vLLM model load slowness, and every candidate GPU node being fully saturated by existing DRA (ResourceClaim) allocations. In the saturated case the validator fails fast with a message like no candidate GPU node has free GPUs — all N matched node(s) are saturated by existing DRA ResourceClaim allocations; the fix is to free GPUs on one of the candidate nodes, or to pass --node-selector kubernetes.io/hostname=<node> to target a specific node you know is free. On clusters where the DRA API is not installed or the validator’s service account cannot list resourceclaims, the check falls back to sizing purely from Status.Allocatable["nvidia.com/gpu"] — which does not account for in-use DRA devices and can leave the benchmark Pending until timeout on a partially-occupied node.