Validator Extension | NVIDIA AI Cluster Runtime

Learn how to add custom validators and override embedded ones using the --data flag.

Overview

Validators follow the same extensibility model as components. The --data flag points to a directory containing custom resources that merge with (or override) the embedded ones. For validators, this means providing a validators/catalog.yaml in your data directory.

my-data/
├── validators/
│   └── catalog.yaml          # Custom/override validator entries
├── overlays/                  # Custom recipe overlays (optional)
├── components/                # Custom component values (optional)
└── registry.yaml              # Custom component registry (optional)

External catalog entries merge with embedded entries at load time. If an external entry has the same name as an embedded one, the external entry replaces it.

Adding a Custom Validator

Step 1: Write the Validator

A validator is any container that follows the exit code contract:

Exit Code	Meaning
`0`	Check passed
`1`	Check failed
`2`	Check skipped

The container receives:

Snapshot at /data/snapshot/snapshot.yaml
Recipe at /data/recipe/recipe.yaml
Kubernetes API access via in-cluster ServiceAccount

Evidence output goes to stdout. Debug logs go to stderr. On failure, write a reason to /dev/termination-log (max 4096 bytes).

Step 2: Build and Push the Image

$ docker build -t my-registry.example.com/my-validator:v1.0.0 .
$ docker push my-registry.example.com/my-validator:v1.0.0

Step 3: Create a Catalog Entry

Create my-data/validators/catalog.yaml:

1 apiVersion: aicr.nvidia.com/v1
2 kind: ValidatorCatalog
3 metadata:
4   name: custom-validators
5   version: "1.0.0"
6 validators:
7   - name: my-custom-check
8     phase: deployment
9     description: "Verify my custom deployment requirement"
10     image: my-registry.example.com/my-validator:v1.0.0
11     timeout: 5m
12     args: ["check"]
13     env: []

Co-locating with a dependency (dependencyAffinity)

A validator catalog entry can declare dependencyAffinity to control where its orchestrator Pod is scheduled relative to the components it queries. Use this when a check’s correctness depends on pod-to-pod network reachability — the canonical case is ai-service-metrics, which dials Prometheus over a ClusterIP Service.

1 - name: ai-service-metrics
2   phase: conformance
3   image: ghcr.io/nvidia/aicr-validators/conformance:latest
4   timeout: 5m
5   args: ["ai-service-metrics"]
6   dependencyAffinity:
7     - componentRef: kube-prometheus-stack
8       podLabelSelector:
9         app.kubernetes.io/name: prometheus
10       requirement: preferred       # or "required"; default "preferred"
11       topologyKey: kubernetes.io/hostname  # default

Fields:

componentRef (required) — the name of a component in the recipe. The deployer resolves it to a namespace at spawn time using the resolved recipe’s componentRefs. If the named component is not in the recipe and requirement: required is set, the run fails before any Job is deployed with ErrCodeInvalidRequest — fix the recipe (add the component) or drop the validator from the validation phase.
podLabelSelector (required) — labels that match the dependency’s pods. All key/value pairs must match.
requirement (optional, default preferred) — required emits requiredDuringSchedulingIgnoredDuringExecution; the scheduler will refuse to place the orchestrator anywhere else. preferred emits preferredDuringSchedulingIgnoredDuringExecution with weight 100; the scheduler treats it as the dominant scoring signal but can fall back to another node if the dependency is unschedulable.
topologyKey (optional, default kubernetes.io/hostname) — the node label whose value defines co-location. The default pins to the same node; use topology.kubernetes.io/zone for zone-level locality.

When in doubt, prefer preferred. The high weight (100) has a strong influence on the scheduler’s scoring on the first run, after which image locality can support (rather than oppose) the affinity. Use required only when the check has no chance of succeeding without exact co-location.

See #933 for the motivating case: on multi-Security-Group EKS clusters where customer-workload and system-workload ENIs sit in separate SGs with asymmetric ingress rules, the orchestrator’s ability to dial Prometheus depends entirely on which node the scheduler picks.

Step 4: Reference in Recipe

Add the check to your recipe’s validation section:

1 validation:
2   deployment:
3     checks:
4       - operator-health        # Embedded validator
5       - expected-resources     # Embedded validator
6       - my-custom-check        # Your custom validator

If you omit the checks list, all catalog entries for the phase run (embedded + custom).

Step 5: Run Validation

$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --data ./my-data \
>   --phase deployment

Overriding Embedded Validators

To replace an embedded validator with a custom implementation, use the same name:

1 # my-data/validators/catalog.yaml
2 apiVersion: aicr.nvidia.com/v1
3 kind: ValidatorCatalog
4 metadata:
5   name: custom-validators
6   version: "1.0.0"
7 validators:
8   - name: operator-health              # Same name as embedded entry
9     phase: deployment
10     description: "Custom operator health check with extended diagnostics"
11     image: my-registry.example.com/custom-operator-health:v1.0.0
12     timeout: 5m
13     args: ["check"]
14     env: []

The external entry replaces the embedded operator-health validator entirely.

Language-Agnostic Contract

The validator contract is a process convention, not a Go interface. Any language works as long as the container follows the exit code and I/O contract.

Bash Example

$ #!/usr/bin/env bash
$ set -euo pipefail
$ 
$ # Read snapshot data (mounted by the validator engine)
$ SNAPSHOT="/data/snapshot/snapshot.yaml"
$ 
$ if [[ ! -f "$SNAPSHOT" ]]; then
$   echo "snapshot not found" > /dev/termination-log
$   exit 1
$ fi
$ 
$ # Check: verify GPU driver version from snapshot
$ DRIVER_VERSION=$(yq '.measurements[] | select(.type == "GPU") | .subtypes[] | select(.name == "smi") | .data.driver_version' "$SNAPSHOT")
$ 
$ if [[ -z "$DRIVER_VERSION" ]]; then
$   echo "GPU driver version not found in snapshot" > /dev/termination-log
$   exit 1
$ fi
$ 
$ REQUIRED="550.90"
$ 
$ # Evidence to stdout
$ echo "GPU driver version: $DRIVER_VERSION"
$ echo "Required minimum:   $REQUIRED"
$ 
$ # Compare versions
$ if printf '%s\n%s' "$REQUIRED" "$DRIVER_VERSION" | sort -V | head -1 | grep -qx "$REQUIRED"; then
$   echo "PASS: driver version meets requirement"
$   exit 0
$ else
$   MSG="FAIL: driver $DRIVER_VERSION &lt; required $REQUIRED"
$   echo "$MSG"
$   echo "$MSG" > /dev/termination-log
$   exit 1
$ fi

Dockerfile:

1 FROM alpine:3.21
2 RUN apk add --no-cache bash yq
3 COPY check.sh /check.sh
4 RUN chmod +x /check.sh
5 ENTRYPOINT ["/check.sh"]

Catalog entry:

1 - name: gpu-driver-version
2   phase: deployment
3   description: "Verify GPU driver meets minimum version"
4   image: my-registry.example.com/gpu-driver-check:v1.0.0
5   timeout: 1m
6   args: []
7   env: []

Image Requirements

Must run as non-root (validator Jobs use runAsNonRoot: true)
Must handle the mounted data paths (/data/snapshot/, /data/recipe/)
Should respect timeout — the Job has activeDeadlineSeconds set from the catalog entry
Should write meaningful evidence to stdout for the CTRF report
Must use explicit image tags (not :latest) for reproducibility in external catalogs

Private Registries

If your validator image is in a private registry, use --image-pull-secret:

$ aicr validate \
>   --recipe recipe.yaml \
>   --data ./my-data \
>   --image-pull-secret my-registry-secret

The secret must exist in the validation namespace and be of type kubernetes.io/dockerconfigjson.

Validator Extension Guide