Validator Development | NVIDIA AI Cluster Runtime

Learn how to add new validation checks to AICR.

Overview

AICR uses a container-per-validator model. Each validation check runs as an isolated Kubernetes Job with access to the cluster, a snapshot, and the recipe. Validators are organized into three phases:

Phase	Purpose	Example
`deployment`	Verify components are installed and healthy	GPU operator pods running, expected resources present
`performance`	Verify system meets performance thresholds	NCCL all-reduce bandwidth (training), AIPerf inference throughput & TTFT p99 (inference+Dynamo)
`conformance`	Verify workload-specific requirements	DRA support, gang scheduling, autoscaling

Architecture:

Declarative Catalog: Validators are defined in recipes/validators/catalog.yaml
Container Contract: Exit code 0 = pass, 1 = fail, 2 = skip
Evidence via stdout: Check output printed to stdout is captured as CTRF evidence
Debug via stderr: Structured logs go to stderr and are streamed to the user
CTRF Reports: Results are aggregated into Common Test Report Format JSON

Quick Start

Adding a new check to an existing validator container requires three steps.

Step 1: Implement the Check Function

Create a new file in the appropriate phase directory (e.g., validators/deployment/):

1 package main
2 
3 import (
4     "fmt"
5     "log/slog"
6 
7     "github.com/NVIDIA/aicr/pkg/errors"
8     "github.com/NVIDIA/aicr/validators"
9     metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
10 )
11 
12 func checkMyComponent(ctx *validators.Context) error {
13     slog.Info("checking my-component health")
14 
15     pods, err := ctx.Clientset.CoreV1().Pods("my-namespace").List(
16         ctx.Ctx,
17         metav1.ListOptions{LabelSelector: "app=my-component"},
18     )
19     if err != nil {
20         return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
21     }
22 
23     if len(pods.Items) == 0 {
24         return errors.New(errors.ErrCodeNotFound, "no my-component pods found")
25     }
26 
27     // Evidence to stdout (captured in CTRF report)
28     fmt.Printf("Found %d my-component pod(s)\n", len(pods.Items))
29     for _, pod := range pods.Items {
30         fmt.Printf("  %s: %s\n", pod.Name, pod.Status.Phase)
31     }
32 
33     return nil
34 }

Step 2: Register in `main.go`

Add the check function to the dispatch map in validators/deployment/main.go:

1 func main() {
2     validators.Run(map[string]validators.CheckFunc{
3         "operator-health":    checkOperatorHealth,
4         "expected-resources": checkExpectedResources,
5         // Add your check here:
6         "my-component":       checkMyComponent,
7     })
8 }

Step 3: Add Catalog Entry

Add an entry to recipes/validators/catalog.yaml:

1 validators:
2   # ... existing entries ...
3 
4   - name: my-component
5     phase: deployment
6     description: "Verify my-component pods are running and healthy"
7     image: ghcr.io/nvidia/aicr-validators/deployment:latest
8     timeout: 2m
9     args: ["my-component"]
10     env: []

The args field must match the key used in the validators.Run() dispatch map.

Container Contract

Every validator container must follow this contract:

Exit Codes

Code	Meaning	CTRF Status
`0`	Check passed	`passed`
`1`	Check failed	`failed`
`2`	Check skipped (not applicable)	`skipped`

I/O Channels

Channel	Purpose	Captured By
stdout	Evidence output (human-readable check results)	CTRF report `message` field
stderr	Debug/progress logs (`slog` output)	Streamed live to user terminal
`/dev/termination-log`	Failure reason (max 4096 bytes)	CTRF report on failure

Mounted Data

The validator engine mounts snapshot and recipe data as ConfigMaps:

Path	Content	Environment Override
`/data/snapshot/snapshot.yaml`	Cluster snapshot	`AICR_SNAPSHOT_PATH`
`/data/recipe/recipe.yaml`	Recipe with constraints	`AICR_RECIPE_PATH`

Environment Variables

Variable	Description
`AICR_NAMESPACE`	Validation namespace (fallback if ServiceAccount namespace unavailable)
`AICR_SNAPSHOT_PATH`	Override snapshot mount path
`AICR_RECIPE_PATH`	Override recipe mount path
`AICR_VALIDATOR_IMAGE_REGISTRY`	Override image registry prefix (set by user)
`AICR_CHECK_TIMEOUT`	Parent-context timeout for the check, injected by the Job deployer from the catalog entry’s `timeout` field (Go duration string, e.g. `30m`). Falls back to `defaults.CheckExecutionTimeout` when unset or malformed; a malformed value is logged at WARN. Use `ctx.Ctx` (set by `LoadContext`) to honor it.
`AICR_VALIDATOR_IMAGE_TAG`	Override the resolved image tag (e.g. `latest`). Bypasses the default `:v<version>` / `:sha-<commit>` resolution for feature-branch dev builds whose commit has no published image.
`AICR_NODE_SELECTOR`	User-provided node selector override for inner workloads (comma-separated `key=value` pairs). Set by the `--node-selector` CLI flag. Use `ctx.NodeSelector` to access the parsed value.
`AICR_TOLERATIONS`	User-provided toleration override for inner workloads (comma-separated `key=value:effect` entries). Set by the `--toleration` CLI flag. Use `ctx.Tolerations` to access the parsed value.

Context API

The validators.Context struct provides all dependencies a check needs:

1 type Context struct {
2     Ctx           context.Context        // Parent context with timeout
3     Cancel        context.CancelFunc     // Release resources (caller must defer)
4     Clientset     kubernetes.Interface   // Typed K8s client
5     RESTConfig    *rest.Config           // For exec, port-forward, dynamic client
6     DynamicClient dynamic.Interface      // For CRD access
7     Snapshot      *snapshotter.Snapshot  // Captured cluster state
8     Recipe        *recipe.RecipeResult   // Recipe with validation config
9     Namespace     string                 // Validation namespace
10     NodeSelector  map[string]string      // User-provided node selector override (nil = use defaults)
11     Tolerations   []corev1.Toleration    // User-provided toleration override (nil = use defaults)
12 }

LoadContext() builds this from the container environment: reads mounted ConfigMaps, creates in-cluster K8s clients, and sets the parent-context timeout via validators/context.go:checkTimeoutFromEnv — which honors AICR_CHECK_TIMEOUT (injected by the Job deployer from the catalog entry’s timeout field) and falls back to defaults.CheckExecutionTimeout when unset or malformed.

Scheduling Overrides

When creating inner workloads (pods, Jobs, TrainJobs), check ctx.NodeSelector and ctx.Tolerations before applying hardcoded platform selectors. If non-nil, these override the default scheduling constraints to support clusters with non-standard GPU node labels or taints.

1 // Apply scheduling overrides when creating inner workload pods.
2 nodeSelector := map[string]string{"cloud.google.com/gke-accelerator": "nvidia-h100-mega-80gb"}
3 if ctx.NodeSelector != nil {
4     nodeSelector = ctx.NodeSelector // user override replaces platform default
5 }
6 
7 tolerations := []corev1.Toleration{{Operator: corev1.TolerationOpExists}}
8 if ctx.Tolerations != nil {
9     tolerations = ctx.Tolerations // user override replaces default tolerate-all
10 }

Validators that use nodeName pinning (e.g., nvidia-smi, DRA isolation) bypass the scheduler entirely and should not apply ctx.NodeSelector.

Helper Methods

ctx.Timeout(d) — Create a child context with a specific timeout:

1 subCtx, cancel := ctx.Timeout(30 * time.Second)
2 defer cancel()
3 pods, err := ctx.Clientset.CoreV1().Pods(ns).List(subCtx, opts)

Runner Utilities

validators.Run(checks) — Main entry point for validator containers. Handles context loading, check dispatch by os.Args[1], exit codes, and termination log writing.

validators.Skip(reason) — Return from a CheckFunc to indicate the check is not applicable. The runner exits with code 2:

1 func checkFeatureX(ctx *validators.Context) error {
2     if ctx.Recipe.Validation == nil {
3         return validators.Skip("no validation section in recipe")
4     }
5     // ... actual check logic ...
6     return nil
7 }

Catalog Entry Schema

Each entry in recipes/validators/catalog.yaml:

1 - name: operator-health           # Unique identifier, used in Job names
2   phase: deployment               # deployment | performance | conformance
3   description: "Human-readable"   # Shown in CTRF report
4   image: ghcr.io/.../img:latest   # OCI image reference
5   timeout: 2m                     # Job activeDeadlineSeconds
6   args: ["operator-health"]       # Container args (check name)
7   env:                            # Optional environment variables
8     - name: MY_VAR
9       value: "my-value"
10   resources:                      # Optional resource requests (omit for defaults)
11     cpu: "100m"
12     memory: "128Mi"

Image tag resolution (applied by catalog.Load):

:latest tags are replaced with the CLI version (e.g., :v0.9.5) for release builds
On non-release dev builds with a valid commit, :latest becomes :sha-<commit> (matches the tags on-push.yaml pushes for merges to main)
Explicit version tags (e.g., :v1.2.3) are not modified by steps 1-2
AICR_VALIDATOR_IMAGE_TAG overrides the resolved tag on every validator image, including explicit catalog tags. Use this when running aicr validate from a feature-branch dev build whose commit has not been merged to main (no :sha-<commit> image has been published). Typical value: latest. Example: AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...
AICR_VALIDATOR_IMAGE_REGISTRY overrides the registry prefix

Digest-pinned references (name@sha256:…) are not rewritten by step 4. A tag override is meaningless against a content-addressable pin, and naive rewriting would corrupt the digest. Step 5’s registry override still applies — only the registry prefix changes, the digest is preserved verbatim.

Env-var forwarding to the validator pod: AICR_CLI_VERSION, AICR_CLI_COMMIT, AICR_VALIDATOR_IMAGE_REGISTRY, and AICR_VALIDATOR_IMAGE_TAG are forwarded from the CLI invocation into the validator container so that validators resolving inner workload images at runtime (e.g. inference-perf’s AIPerf benchmark Job) apply the same semantics as catalog.Load. If you set AICR_VALIDATOR_IMAGE_TAG=latest on the CLI, the override reaches both the outer validator Job and the inner benchmark Job — they always travel together.

Pull-policy behavior when the override is set: both the outer validator Job and every inner workload Job it dispatches route through the shared catalog.ImagePullPolicy(image) helper (pkg/validator/catalog/catalog.go). The rule, in precedence order, is:

Side-loaded refs (ko.local/*, kind.local/*) → Never (no registry to pull from).
Digest-pinned refs (name@sha256:…) → IfNotPresent. Cryptographic immutability means a cached copy is always correct; forcing Always here would make kubelet re-contact the registry every run, which breaks disconnected / air-gapped clusters even though the image itself was never overridden.
AICR_VALIDATOR_IMAGE_TAG is set → Always. Override values are typically mutable (latest, edge, main, or any tag on-push.yaml recreates on every merge), so IfNotPresent would let a node’s previously cached image win over the tag’s current target.
:latest suffix → Always. Mutable tag by convention.
Otherwise → IfNotPresent. Versioned tag assumed immutable enough that caching is a win.

Callers in this repo: the outer validator Job’s Deployer.imagePullPolicy() (pkg/validator/job/deployer.go) and the inner AIPerf benchmark pod spec in buildAIPerfJob (validators/performance/inference_perf_constraint.go). They both delegate to the same helper so their policy can’t drift. When adding a new inner workload Job in validators/<phase>/*, set ImagePullPolicy: catalog.ImagePullPolicy(<resolved image>) on the container to keep the invariant.

Performance phase example — inference perf:

1 - name: inference-perf
2   phase: performance
3   description: "Verify inference throughput and TTFT p99 meet thresholds using AIPerf"
4   image: ghcr.io/nvidia/aicr-validators/performance:latest
5   timeout: 50m
6   args: ["inference-perf"]

Paired constraints in an overlay (one per metric the check produces):

1 validation:
2   performance:
3     checks: [inference-perf]
4     constraints:
5       - name: inference-throughput   # output tokens/sec, >= threshold
6         value: ">= 5000"
7       - name: inference-ttft-p99     # time-to-first-token p99 in ms, <= threshold
8         value: "<= 200"

Performance Validators

Two performance checks ship today, both registered in validators/performance/main.go:

Check	Intent	Workload	Constraints
`nccl-all-reduce-bw`	training	NCCL `all_reduce_perf` under a Kubeflow `TrainJob`	`nccl-all-reduce-bw >= N GB/s`
`inference-perf`	inference+Dynamo	`DynamoGraphDeployment` (vLLM, Qwen/Qwen3-0.6B) + AIPerf Job	`inference-throughput >= N tok/s`, `inference-ttft-p99 <= N ms`

Both follow a consistent lifecycle:

Deploy a fresh benchmark workload. inference-perf always provisions its own DynamoGraphDeployment into a per-run namespace (aicr-inference-perf-<hash>) derived from AICR_RUN_ID, so two concurrent runs cannot collide and a prior run’s leftovers cannot be silently adopted. An earlier design sketch had a “discover existing frontend” path — it was intentionally dropped because it admitted ambiguity about which service was being benchmarked on shared clusters.
Wait for readiness via the watch API (not polling) on the workload CR’s status.
Run the benchmark in a K8s Job, capturing stdout with sentinels that survive noisy logs.
Parse and evaluate against recipe constraints with a 10% tolerance.
Defer cleanup — the per-run namespace is torn down on both success and failure so leaked workloads from interrupted prior runs are reaped on the next invocation.

The inference check injects pod-scheduling (nodeSelector, tolerations, DRA resourceClaims) into the unstructured DynamoGraphDeployment programmatically rather than via text substitution, to avoid YAML-escape issues with taint values.

AIPerf runner image. The benchmark Job spawned by inference-perf pulls a pre-built image (ghcr.io/nvidia/aicr-validators/aiperf-bench:<tag>) with aiperf already pip install-ed. The image is published by the same on-tag.yaml workflow that publishes the three Go validator images; its Dockerfile at validators/performance/aiperf-bench.Dockerfile pins the AIPERF_VERSION build arg. Baking the install at release time (rather than pip install on every benchmark pod) removes the PyPI runtime dependency, eliminates a ~30 s warmup, and keeps the check air-gap-friendly on clusters with only ghcr.io access.

Code Walkthrough

The operator_health.go check demonstrates the standard pattern:

1 // validators/deployment/operator_health.go
2 
3 func checkOperatorHealth(ctx *validators.Context) error {
4     // 1. Use slog for debug output (goes to stderr, streamed to user)
5     slog.Info("listing pods", "namespace", gpuOperatorNamespace)
6 
7     // 2. Use ctx.Clientset for K8s API calls
8     pods, err := ctx.Clientset.CoreV1().Pods(gpuOperatorNamespace).List(
9         ctx.Ctx,
10         metav1.ListOptions{LabelSelector: gpuOperatorLabel},
11     )
12     if err != nil {
13         // 3. Return wrapped errors for failures
14         return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
15     }
16 
17     // 4. Print evidence to stdout (captured in CTRF report)
18     fmt.Printf("Found %d gpu-operator pod(s):\n", len(pods.Items))
19     for _, pod := range pods.Items {
20         fmt.Printf("  %s: %s\n", pod.Name, pod.Status.Phase)
21     }
22 
23     // 5. Return nil for pass, non-nil error for fail
24     if runningCount == 0 {
25         return errors.New(errors.ErrCodeInternal, "no pods in Running state")
26     }
27     return nil
28 }

Key patterns:

slog.* → stderr → streamed live to user
fmt.Printf → stdout → captured as CTRF evidence
return nil → exit 0 → passed
return errors.* → exit 1 → failed (message written to termination log)
return validators.Skip(reason) → exit 2 → skipped

Directory Layout

validators/
├── context.go              # Shared Context type and LoadContext()
├── runner.go               # Run() entry point, exit code handling
├── deployment/             # Deployment phase validators
│   ├── main.go             # Check dispatch map
│   ├── Dockerfile          # Container image build
│   ├── operator_health.go  # Individual check implementation
│   ├── expected_resources.go
│   └── ...
├── performance/            # Performance phase validators
│   ├── main.go                       # Registers nccl-all-reduce-bw, inference-perf
│   ├── Dockerfile
│   ├── nccl_all_reduce_bw.go             # Training: NCCL CheckFunc wrapper
│   ├── nccl_all_reduce_bw_constraint.go  # Training: NCCL pipeline (deploy → bench → parse)
│   ├── inference_perf.go                 # Inference: AIPerf CheckFunc wrapper (constraint eval)
│   ├── inference_perf_constraint.go      # Inference: Dynamo deploy → AIPerf → parse pipeline
│   ├── aiperf-bench.Dockerfile           # Pre-built AIPerf benchmark runner image
│   └── testdata/                         # Workload YAML templates (NCCL TrainJob, Dynamo CR, DRA claim)
├── conformance/            # Conformance phase validators
│   ├── main.go
│   ├── Dockerfile
│   └── ...
└── chainsaw/               # Chainsaw test runner utilities
    └── ...

Each phase directory produces one container image. Multiple checks are compiled into a single binary and selected via the first argument.

Testing

Unit Tests

Use fake K8s clients for isolated testing:

1 func TestCheckMyComponent(t *testing.T) {
2     tests := []struct {
3         name    string
4         pods    []corev1.Pod
5         wantErr bool
6     }{
7         {
8             name: "healthy pods",
9             pods: []corev1.Pod{
10                 {
11                     ObjectMeta: metav1.ObjectMeta{
12                         Name:   "my-pod",
13                         Labels: map[string]string{"app": "my-component"},
14                     },
15                     Status: corev1.PodStatus{Phase: corev1.PodRunning},
16                 },
17             },
18             wantErr: false,
19         },
20         {
21             name:    "no pods found",
22             pods:    []corev1.Pod{},
23             wantErr: true,
24         },
25     }
26 
27     for _, tt := range tests {
28         t.Run(tt.name, func(t *testing.T) {
29             objects := make([]runtime.Object, len(tt.pods))
30             for i := range tt.pods {
31                 objects[i] = &tt.pods[i]
32             }
33             ctx := &validators.Context{
34                 Ctx:       context.TODO(),
35                 Clientset: fake.NewClientset(objects...),
36                 Namespace: "test",
37             }
38             err := checkMyComponent(ctx)
39             if (err != nil) != tt.wantErr {
40                 t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
41             }
42         })
43     }
44 }

Local Testing with Docker

Build and run a validator locally against mounted data:

$ # Build the validator image
$ docker build -f validators/deployment/Dockerfile -t my-validator .
$ 
$ # Run with mounted snapshot and recipe
$ docker run --rm \
>   -v ./snapshot.yaml:/data/snapshot/snapshot.yaml \
>   -v ./recipe.yaml:/data/recipe/recipe.yaml \
>   my-validator my-component
$ 
$ # Check exit code
$ echo $?  # 0=pass, 1=fail, 2=skip

Note: K8s API calls will fail locally unless you mount a kubeconfig. For checks that only read snapshot/recipe data, this works without cluster access.

Testing with Custom Images

When developing validators, you can build and push a custom image to test on a live cluster before merging.

Edit the embedded catalog to point at your custom image and rebuild the CLI:

1 # recipes/validators/catalog.yaml
2   - name: nccl-all-reduce-bw
3     phase: performance
4     image: my-registry.example.com/my-validator:dev  # custom image
5     timeout: 30m
6     args: ["nccl-all-reduce-bw"]

$ make build
$ ./dist/aicr_*/aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \
>   --image-pull-secret my-registry-secret

The catalog is embedded in the binary at build time, so a rebuild is required. Revert before pushing:

$ git checkout -- recipes/validators/catalog.yaml

Use a unique tag for every rebuild. Catalog entries use pinned image tags, which Kubernetes resolves with imagePullPolicy: IfNotPresent by default — so re-pushing the same tag (e.g., :dev) leaves previously-pulled nodes running the stale image. In dev loops, suffix the tag per iteration (:dev-v1, :dev-v2, or :dev-$(git rev-parse --short HEAD)) so every rebuild forces a fresh pull cluster-wide. Release builds avoid this entirely because on-tag.yaml publishes semver tags that are never reused.

Private Registry Authentication

If your image is in a private registry, create an image pull secret in the validation namespace and pass it to the CLI with --image-pull-secret:

$ # Create the secret (use --dry-run=client | apply for idempotent create-or-update)
$ kubectl create secret docker-registry my-registry-secret \
>   --docker-server=nvcr.io \
>   --docker-username='$oauthtoken' \
>   --docker-password=$NGC_API_KEY \
>   -n aicr-validation \
>   --dry-run=client -o yaml | kubectl apply -f -
$ 
$ # Run validation with the secret
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --image-pull-secret my-registry-secret

The secret must be of type kubernetes.io/dockerconfigjson and exist in the validation namespace. The --image-pull-secret flag can be repeated for multiple registries.

Checklist

When adding a new upstream check:

Create validators/{phase}/my_check.go implementing CheckFunc
Register in validators/{phase}/main.go dispatch map
Add catalog entry in recipes/validators/catalog.yaml
Add the check name to the recipe’s validation.{phase}.checks[] (or omit to run all)
Write table-driven unit tests with fake K8s clients
Test locally with docker run and mounted data
Run make test with race detector

Validator Development Guide

Overview

Quick Start

Step 1: Implement the Check Function

Step 2: Register in `main.go`

Step 3: Add Catalog Entry

Container Contract

Exit Codes

I/O Channels

Mounted Data

Environment Variables

Context API

Scheduling Overrides

Helper Methods

Runner Utilities

Catalog Entry Schema

Performance Validators

Code Walkthrough

Directory Layout

Testing

Unit Tests

Local Testing with Docker

Testing with Custom Images

Private Registry Authentication

Checklist

See Also

Overview

Quick Start

Step 1: Implement the Check Function

Step 2: Register in main.go

Step 3: Add Catalog Entry

Container Contract

Exit Codes

I/O Channels

Mounted Data

Environment Variables

Context API

Scheduling Overrides

Helper Methods

Runner Utilities

Catalog Entry Schema

Performance Validators

Code Walkthrough

Directory Layout

Testing

Unit Tests

Local Testing with Docker

Testing with Custom Images

Private Registry Authentication

Checklist

See Also

Step 2: Register in `main.go`