Validator Extension Guide

View as Markdown

Learn how to add custom validators and override embedded ones using the --data flag.

Overview

Validators follow the same extensibility model as components. The --data flag points to a directory containing custom resources that merge with (or override) the embedded ones. For validators, this means providing a validators/catalog.yaml in your data directory.

my-data/
├── validators/
│ └── catalog.yaml # Custom/override validator entries
├── overlays/ # Custom recipe overlays (optional)
├── components/ # Custom component values (optional)
└── registry.yaml # Custom component registry (optional)

External catalog entries merge with embedded entries at load time. If an external entry has the same name as an embedded one, the external entry replaces it.

Adding a Custom Validator

Step 1: Write the Validator

A validator is any container that follows the exit code contract:

Exit CodeMeaning
0Check passed
1Check failed
2Check skipped

The container receives:

  • Snapshot at /data/snapshot/snapshot.yaml
  • Recipe at /data/recipe/recipe.yaml
  • Kubernetes API access via in-cluster ServiceAccount

Evidence output goes to stdout. Debug logs go to stderr. On failure, write a reason to /dev/termination-log (max 4096 bytes).

Step 2: Build and Push the Image

$docker build -t my-registry.example.com/my-validator:v1.0.0 .
$docker push my-registry.example.com/my-validator:v1.0.0

Step 3: Create a Catalog Entry

Create my-data/validators/catalog.yaml:

1apiVersion: aicr.nvidia.com/v1
2kind: ValidatorCatalog
3metadata:
4 name: custom-validators
5 version: "1.0.0"
6validators:
7 - name: my-custom-check
8 phase: deployment
9 description: "Verify my custom deployment requirement"
10 image: my-registry.example.com/my-validator:v1.0.0
11 timeout: 5m
12 args: ["check"]
13 env: []

Step 4: Reference in Recipe

Add the check to your recipe’s validation section:

1validation:
2 deployment:
3 checks:
4 - operator-health # Embedded validator
5 - expected-resources # Embedded validator
6 - my-custom-check # Your custom validator

If you omit the checks list, all catalog entries for the phase run (embedded + custom).

Step 5: Run Validation

$aicr validate \
> --recipe recipe.yaml \
> --snapshot snapshot.yaml \
> --data ./my-data \
> --phase deployment

Overriding Embedded Validators

To replace an embedded validator with a custom implementation, use the same name:

1# my-data/validators/catalog.yaml
2apiVersion: aicr.nvidia.com/v1
3kind: ValidatorCatalog
4metadata:
5 name: custom-validators
6 version: "1.0.0"
7validators:
8 - name: operator-health # Same name as embedded entry
9 phase: deployment
10 description: "Custom operator health check with extended diagnostics"
11 image: my-registry.example.com/custom-operator-health:v1.0.0
12 timeout: 5m
13 args: ["check"]
14 env: []

The external entry replaces the embedded operator-health validator entirely.

Language-Agnostic Contract

The validator contract is a process convention, not a Go interface. Any language works as long as the container follows the exit code and I/O contract.

Bash Example

$#!/usr/bin/env bash
$set -euo pipefail
$
$# Read snapshot data (mounted by the validator engine)
$SNAPSHOT="/data/snapshot/snapshot.yaml"
$
$if [[ ! -f "$SNAPSHOT" ]]; then
$ echo "snapshot not found" > /dev/termination-log
$ exit 1
$fi
$
$# Check: verify GPU driver version from snapshot
$DRIVER_VERSION=$(yq '.measurements[] | select(.type == "GPU") | .subtypes[] | select(.name == "smi") | .data.driver_version' "$SNAPSHOT")
$
$if [[ -z "$DRIVER_VERSION" ]]; then
$ echo "GPU driver version not found in snapshot" > /dev/termination-log
$ exit 1
$fi
$
$REQUIRED="550.90"
$
$# Evidence to stdout
$echo "GPU driver version: $DRIVER_VERSION"
$echo "Required minimum: $REQUIRED"
$
$# Compare versions
$if printf '%s\n%s' "$REQUIRED" "$DRIVER_VERSION" | sort -V | head -1 | grep -qx "$REQUIRED"; then
$ echo "PASS: driver version meets requirement"
$ exit 0
$else
$ MSG="FAIL: driver $DRIVER_VERSION < required $REQUIRED"
$ echo "$MSG"
$ echo "$MSG" > /dev/termination-log
$ exit 1
$fi

Dockerfile:

1FROM alpine:3.21
2RUN apk add --no-cache bash yq
3COPY check.sh /check.sh
4RUN chmod +x /check.sh
5ENTRYPOINT ["/check.sh"]

Catalog entry:

1- name: gpu-driver-version
2 phase: deployment
3 description: "Verify GPU driver meets minimum version"
4 image: my-registry.example.com/gpu-driver-check:v1.0.0
5 timeout: 1m
6 args: []
7 env: []

Image Requirements

  • Must run as non-root (validator Jobs use runAsNonRoot: true)
  • Must handle the mounted data paths (/data/snapshot/, /data/recipe/)
  • Should respect timeout — the Job has activeDeadlineSeconds set from the catalog entry
  • Should write meaningful evidence to stdout for the CTRF report
  • Must use explicit image tags (not :latest) for reproducibility in external catalogs

Private Registries

If your validator image is in a private registry, use --image-pull-secret:

$aicr validate \
> --recipe recipe.yaml \
> --data ./my-data \
> --image-pull-secret my-registry-secret

The secret must exist in the validation namespace and be of type kubernetes.io/dockerconfigjson.

See Also