Validator Development Guide
Learn how to add new validation checks to AICR.
Overview
AICR uses a container-per-validator model. Each validation check runs as an isolated Kubernetes Job with access to the cluster, a snapshot, and the recipe. Validators are organized into three phases:
Architecture:
- Declarative Catalog: Validators are defined in
recipes/validators/catalog.yaml - Container Contract: Exit code 0 = pass, 1 = fail, 2 = skip
- Evidence via stdout: Check output printed to stdout is captured as CTRF evidence
- Debug via stderr: Structured logs go to stderr and are streamed to the user
- CTRF Reports: Results are aggregated into Common Test Report Format JSON
Quick Start
Adding a new check to an existing validator container requires three steps.
Step 1: Implement the Check Function
Create a new file in the appropriate phase directory (e.g., validators/deployment/):
Step 2: Register in main.go
Add the check function to the dispatch map in validators/deployment/main.go:
Step 3: Add Catalog Entry
Add an entry to recipes/validators/catalog.yaml:
The args field must match the key used in the validators.Run() dispatch map.
Container Contract
Every validator container must follow this contract:
Exit Codes
I/O Channels
Mounted Data
The validator engine mounts snapshot and recipe data as ConfigMaps:
Environment Variables
Context API
The validators.Context struct provides all dependencies a check needs:
LoadContext() builds this from the container environment: reads mounted ConfigMaps, creates in-cluster K8s clients, and sets the parent-context timeout via validators/context.go:checkTimeoutFromEnv — which honors AICR_CHECK_TIMEOUT (injected by the Job deployer from the catalog entry’s timeout field) and falls back to defaults.CheckExecutionTimeout when unset or malformed.
Scheduling Overrides
When creating inner workloads (pods, Jobs, TrainJobs), check ctx.NodeSelector and ctx.Tolerations before applying hardcoded platform selectors. If non-nil, these override the default scheduling constraints to support clusters with non-standard GPU node labels or taints.
Validators that use nodeName pinning (e.g., nvidia-smi, DRA isolation) bypass the scheduler entirely and should not apply ctx.NodeSelector.
Helper Methods
ctx.Timeout(d) — Create a child context with a specific timeout:
Runner Utilities
validators.Run(checks) — Main entry point for validator containers. Handles context loading, check dispatch by os.Args[1], exit codes, and termination log writing.
validators.Skip(reason) — Return from a CheckFunc to indicate the check is not applicable. The runner exits with code 2:
Catalog Entry Schema
Each entry in recipes/validators/catalog.yaml:
Image tag resolution (applied by catalog.Load):
:latesttags are replaced with the CLI version (e.g.,:v0.9.5) for release builds- On non-release dev builds with a valid commit,
:latestbecomes:sha-<commit>(matches the tagson-push.yamlpushes for merges tomain) - Explicit version tags (e.g.,
:v1.2.3) are not modified by steps 1-2 AICR_VALIDATOR_IMAGE_TAGoverrides the resolved tag on every validator image, including explicit catalog tags. Use this when runningaicr validatefrom a feature-branch dev build whose commit has not been merged tomain(no:sha-<commit>image has been published). Typical value:latest. Example:AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...AICR_VALIDATOR_IMAGE_REGISTRYoverrides the registry prefix
Digest-pinned references (name@sha256:…) are not rewritten by step 4. A tag override is meaningless against a content-addressable pin, and naive rewriting would corrupt the digest. Step 5’s registry override still applies — only the registry prefix changes, the digest is preserved verbatim.
Env-var forwarding to the validator pod: AICR_CLI_VERSION, AICR_CLI_COMMIT, AICR_VALIDATOR_IMAGE_REGISTRY, and AICR_VALIDATOR_IMAGE_TAG are forwarded from the CLI invocation into the validator container so that validators resolving inner workload images at runtime (e.g. inference-perf’s AIPerf benchmark Job) apply the same semantics as catalog.Load. If you set AICR_VALIDATOR_IMAGE_TAG=latest on the CLI, the override reaches both the outer validator Job and the inner benchmark Job — they always travel together.
Pull-policy behavior when the override is set: both the outer validator Job and every inner workload Job it dispatches route through the shared catalog.ImagePullPolicy(image) helper (pkg/validator/catalog/catalog.go). The rule, in precedence order, is:
- Side-loaded refs (
ko.local/*,kind.local/*) →Never(no registry to pull from). - Digest-pinned refs (
name@sha256:…) →IfNotPresent. Cryptographic immutability means a cached copy is always correct; forcingAlwayshere would make kubelet re-contact the registry every run, which breaks disconnected / air-gapped clusters even though the image itself was never overridden. AICR_VALIDATOR_IMAGE_TAGis set →Always. Override values are typically mutable (latest,edge,main, or any tagon-push.yamlrecreates on every merge), soIfNotPresentwould let a node’s previously cached image win over the tag’s current target.:latestsuffix →Always. Mutable tag by convention.- Otherwise →
IfNotPresent. Versioned tag assumed immutable enough that caching is a win.
Callers in this repo: the outer validator Job’s Deployer.imagePullPolicy() (pkg/validator/job/deployer.go) and the inner AIPerf benchmark pod spec in buildAIPerfJob (validators/performance/inference_perf_constraint.go). They both delegate to the same helper so their policy can’t drift. When adding a new inner workload Job in validators/<phase>/*, set ImagePullPolicy: catalog.ImagePullPolicy(<resolved image>) on the container to keep the invariant.
Performance phase example — inference perf:
Paired constraints in an overlay (one per metric the check produces):
Performance Validators
Two performance checks ship today, both registered in validators/performance/main.go:
Both follow a consistent lifecycle:
- Deploy a fresh benchmark workload.
inference-perfalways provisions its ownDynamoGraphDeploymentinto a per-run namespace (aicr-inference-perf-<hash>) derived fromAICR_RUN_ID, so two concurrent runs cannot collide and a prior run’s leftovers cannot be silently adopted. An earlier design sketch had a “discover existing frontend” path — it was intentionally dropped because it admitted ambiguity about which service was being benchmarked on shared clusters. - Wait for readiness via the watch API (not polling) on the workload CR’s status.
- Run the benchmark in a K8s Job, capturing stdout with sentinels that survive noisy logs.
- Parse and evaluate against recipe constraints with a 10% tolerance.
- Defer cleanup — the per-run namespace is torn down on both success and failure so leaked workloads from interrupted prior runs are reaped on the next invocation.
The inference check injects pod-scheduling (nodeSelector, tolerations, DRA resourceClaims) into the unstructured DynamoGraphDeployment programmatically rather than via text substitution, to avoid YAML-escape issues with taint values.
AIPerf runner image. The benchmark Job spawned by inference-perf pulls a pre-built image (ghcr.io/nvidia/aicr-validators/aiperf-bench:<tag>) with aiperf already pip install-ed. The image is published by the same on-tag.yaml workflow that publishes the three Go validator images; its Dockerfile at validators/performance/aiperf-bench.Dockerfile pins the AIPERF_VERSION build arg. Baking the install at release time (rather than pip install on every benchmark pod) removes the PyPI runtime dependency, eliminates a ~30 s warmup, and keeps the check air-gap-friendly on clusters with only ghcr.io access.
Code Walkthrough
The operator_health.go check demonstrates the standard pattern:
Key patterns:
slog.*→ stderr → streamed live to userfmt.Printf→ stdout → captured as CTRF evidencereturn nil→ exit 0 → passedreturn errors.*→ exit 1 → failed (message written to termination log)return validators.Skip(reason)→ exit 2 → skipped
Directory Layout
Each phase directory produces one container image. Multiple checks are compiled into a single binary and selected via the first argument.
Testing
Unit Tests
Use fake K8s clients for isolated testing:
Local Testing with Docker
Build and run a validator locally against mounted data:
Note: K8s API calls will fail locally unless you mount a kubeconfig. For checks that only read snapshot/recipe data, this works without cluster access.
Testing with Custom Images
When developing validators, you can build and push a custom image to test on a live cluster before merging.
Edit the embedded catalog to point at your custom image and rebuild the CLI:
The catalog is embedded in the binary at build time, so a rebuild is required. Revert before pushing:
Use a unique tag for every rebuild. Catalog entries use pinned image tags, which Kubernetes resolves with imagePullPolicy: IfNotPresent by default — so re-pushing the same tag (e.g., :dev) leaves previously-pulled nodes running the stale image. In dev loops, suffix the tag per iteration (:dev-v1, :dev-v2, or :dev-$(git rev-parse --short HEAD)) so every rebuild forces a fresh pull cluster-wide. Release builds avoid this entirely because on-tag.yaml publishes semver tags that are never reused.
Private Registry Authentication
If your image is in a private registry, create an image pull secret in the validation namespace and pass it to the CLI with --image-pull-secret:
The secret must be of type kubernetes.io/dockerconfigjson and exist in the validation namespace. The --image-pull-secret flag can be repeated for multiple registries.
Checklist
When adding a new upstream check:
- Create
validators/{phase}/my_check.goimplementingCheckFunc - Register in
validators/{phase}/main.godispatch map - Add catalog entry in
recipes/validators/catalog.yaml - Add the check name to the recipe’s
validation.{phase}.checks[](or omit to run all) - Write table-driven unit tests with fake K8s clients
- Test locally with
docker runand mounted data - Run
make testwith race detector
See Also
- Validator Extension Guide — External validators via
--data - Validator Catalog Reference — Catalog schema and entries
- Validator V2 ADR — Architecture decision record
- CLI Reference — Validate command flags