Recipe Development Guide

View as Markdown

This guide covers how to create, modify, and validate recipe metadata.

Quick Start: Contributing a Recipe

New to recipe development? Follow these minimal steps to contribute:

1. Copy an existing overlay (details)

$cp recipes/overlays/h100-eks-ubuntu-training.yaml recipes/overlays/gb200-eks-ubuntu-training.yaml

2. Edit criteria and components (criteria, components)

1# recipes/overlays/gb200-eks-ubuntu-training.yaml
2spec:
3 base: eks-training # Inherit from intermediate recipe
4 criteria:
5 service: eks
6 accelerator: gb200 # Changed from h100
7 os: ubuntu
8 intent: training
9 componentRefs:
10 - name: gpu-operator
11 version: v26.3.1
12 valuesFile: components/gpu-operator/eks-gb200-training.yaml
13 overrides:
14 driver:
15 version: "580.82.07" # GB200-specific driver

3. Run tests (details)

$make test # Validates schema, criteria, references, constraints
$make qualify # Includes end-to-end tests before submitting

4. Open PR (best practices)

  • Include test output showing recipe generation works
  • Explain why the recipe is needed (new hardware, workload, platform)

Overview

Recipe metadata files define component configurations for GPU-accelerated Kubernetes deployments using a base-plus-overlay architecture with three composition mechanisms — single-parent inheritance, explicit mixin composition, and criteria-wildcard matching:

  • Base values (overlays/base.yaml) - universal defaults
  • Intermediate recipes (eks.yaml, eks-training.yaml) - shared configurations for categories
  • Leaf recipes (gb200-eks-ubuntu-training.yaml) - hardware/workload-specific overrides
  • Mixins (mixins/*.yaml) - composable fragments (OS constraints, platform components) that leaf overlays reference via spec.mixins instead of duplicating content
  • Criteria-wildcard overlays (gb200-any.yaml) - cross-cutting overlays picked up automatically by the resolver when their wildcard criteria match the query, without being referenced via spec.base or spec.mixins
  • Inline overrides - per-recipe customization without new files

Recipe files in recipes/ are embedded at compile time. Integrators can extend or override using the --data flag (see Advanced Topics).

For query matching and overlay merging internals, see Data Architecture.

Recipe Structure

Multi-Level Inheritance

Recipes use spec.base to inherit configurations. Chains progress from general (base) to specific (leaf):

base.yaml → eks.yaml → eks-training.yaml → gb200-eks-ubuntu-training.yaml

Intermediate recipes (partial criteria) capture shared configs:

1# eks-training.yaml
2spec:
3 base: eks
4 criteria:
5 service: eks
6 intent: training # Partial - no accelerator/OS
7 componentRefs:
8 - name: gpu-operator
9 valuesFile: components/gpu-operator/values-eks-training.yaml

Leaf recipes (complete criteria) match user queries:

1# gb200-eks-ubuntu-training.yaml
2spec:
3 base: eks-training # Inherits from intermediate
4 criteria:
5 service: eks
6 accelerator: gb200
7 os: ubuntu
8 intent: training # Complete
9 componentRefs:
10 - name: gpu-operator
11 overrides:
12 driver:
13 version: "580.82.07" # Hardware-specific override

Leaf recipes with mixins compose shared fragments:

1# h100-eks-ubuntu-training-kubeflow.yaml
2spec:
3 base: h100-eks-ubuntu-training
4 mixins:
5 - os-ubuntu # Shared Ubuntu constraints (from recipes/mixins/)
6 - platform-kubeflow # Kubeflow trainer component (from recipes/mixins/)
7 criteria:
8 service: eks
9 accelerator: h100
10 os: ubuntu
11 intent: training
12 platform: kubeflow

Mixins use kind: RecipeMixin and carry only constraints and componentRefs. They live in recipes/mixins/ and are applied after inheritance chain merging. See Data Architecture for details.

Some platforms declare their full component stack inline per leaf overlay rather than via a platform mixin. This is the case for --platform slurm and --platform dynamo, where each leaf carries hardware-specific tuning (GPU GRES strings, accelerator resource limits) that the mixin merge path cannot represent cleanly. Other platforms like --platform kubeflow and --platform inference still use the platform-kubeflow / platform-inference mixins shown above, since their leaf-specific tuning is minimal.

For example, --platform slurm leaves inline three componentRefs:

  • slinky-slurm-operator-crds — SchedMD Slinky CRDs
  • slinky-slurm-operator — the operator and admission webhook
  • slinky-slurm — the Slinky-managed Slurm cluster instance (Controller / LoginSet / NodeSet / RestApi), with leaf-specific overrides (e.g. H100 GRES wiring on the nodesets.slinky map)

This is the same shape dynamo-platform uses across the *-inference-dynamo leaves. See recipes/overlays/h100-eks-ubuntu-training-slurm.yaml for the full example.

When authoring a recipe targeting Talos (criteria.os: talos), append the os-talos mixin to your overlay’s spec.mixins list (e.g. spec.mixins: [os-talos], or [platform-kubeflow, os-talos] if you already mix in a non-OS fragment). OS-scoped mixins are mutually exclusive — combining os-ubuntu and os-talos in one overlay is a recipe authoring error, not a supported composition. The mixin overrides namespaces for affected components and supplies PSA-privileged Namespace manifests via componentRefs[].preManifestFiles, which are applied before each chart — see Talos integration for the component list and labels.

Cross-cutting overlays with wildcard criteria apply across one criteria dimension without being referenced via spec.base or listed in spec.mixins. The resolver can return multiple independent maximal-leaf overlays for a single query, so a service: any overlay is picked up alongside the service-specific maximal leaf and its inheritance chain:

1# gb200-any.yaml — applies to every GB200 query (any service, any intent)
2spec:
3 base: base
4 criteria:
5 service: any # Wildcard — matches eks, oke, gke, etc.
6 accelerator: gb200
7 validation:
8 deployment:
9 checks:
10 - operator-health
11 - expected-resources
12 - gpu-operator-version
13 - check-nvidia-smi
14 constraints:
15 - name: Deployment.gpu-operator.version
16 value: ">= v25.10.0"

Only use this pattern when the content is truly uniform across the wildcard dimension — if values diverge per service, keep them inline in each service-specific overlay. NCCL performance thresholds, for example, are explicitly not a good fit for this pattern: each service has a different network fabric (EFA, TCPXO, RoCE, etc.) and the same bandwidth number is rarely correct across two fabrics. The intent-scoped gb200-any-training.yaml shape that previously carried a cross-service NCCL threshold was retired in #1052 in favor of per-leaf performance blocks. See Data Architecture for when to use wildcard overlays vs mixins.

Merge order: base.yaml (lowest) → intermediate → leaf → mixins (highest)

Merge rules:

  • Constraints: same-named overridden, new added
  • ComponentRefs: same-named merged field-by-field, new added
  • validation.<phase> blocks merge per-field: checks and constraints union and deduplicate when non-empty (constraints by name, overlay wins on same-name); an explicit empty list (checks: [] / constraints: []) clears the inherited list, while an omitted/null field inherits it; nodeSelection replaced wholesale when set; timeout/infrastructure overlay-wins-if-non-empty
  • Criteria: not inherited (each recipe defines its own)
  • Mixin constraints/components must not conflict with the inheritance chain or other mixins

Component Types

Helm components (most common):

1componentRefs:
2 - name: gpu-operator
3 type: Helm
4 version: v26.3.1
5 valuesFile: components/gpu-operator/values.yaml
6 overrides:
7 driver:
8 version: "580.82.07"

Kustomize components

1componentRefs:
2 - name: my-app
3 type: Kustomize
4 source: https://github.com/example/my-app
5 tag: v1.0.0
6 path: deploy/production

A component must have either helm OR kustomize configuration, not both.

Component Configuration

Configuration Patterns

Pattern 1: ValuesFile only (large, reusable configs)

1componentRefs:
2 - name: cert-manager
3 valuesFile: components/cert-manager/eks-values.yaml

Pattern 2: Overrides only (small, recipe-specific configs)

1componentRefs:
2 - name: nvsentinel
3 overrides:
4 namespace: nvsentinel
5 sentinel:
6 enabled: true

Pattern 3: Hybrid (shared base + recipe tweaks)

1componentRefs:
2 - name: gpu-operator
3 valuesFile: components/gpu-operator/eks-gb200-training.yaml
4 overrides:
5 driver:
6 version: "580.82.07" # Override just this field

Value Merge Precedence

Values merge from lowest to highest precedence:

Base → ValuesFile → Overrides → CLI --set flags

Deep merge: only specified fields replaced, unspecified preserved. Arrays replaced entirely (not element-by-element).

Example:

1# Base: driver.version="550.54.15", driver.repository="nvcr.io/nvidia"
2# ValuesFile: driver.version="570.86.16"
3# Override: driver.version="580.13.01"
4# Result: driver.version="580.13.01", driver.repository="nvcr.io/nvidia" (preserved)

File Naming Conventions

File names are for human readability—matching uses spec.criteria, not file names.

Overlay naming: \{accelerator\}-\{service\}-\{os\}-\{intent\}-\{platform\}.yaml (platform always last)

File TypePatternExample
Service\{service\}.yamleks.yaml
Service + intent\{service\}-\{intent\}.yamleks-training.yaml
Full criteria\{accel\}-\{service\}-\{os\}-\{intent\}.yamlgb200-eks-ubuntu-training.yaml
+ platform\{accel\}-\{service\}-\{os\}-\{intent\}-\{platform\}.yamlgb200-eks-ubuntu-training-kubeflow.yaml
Mixin (OS)os-\{os\}.yamlos-ubuntu.yaml
Mixin (platform)platform-\{platform\}.yamlplatform-kubeflow.yaml
Component valuesvalues-\{service\}-\{intent\}.yamlvalues-eks-training.yaml

Constraints and Validation

Constraints

Constraints validate deployment requirements against cluster snapshots:

1constraints:
2 - name: K8s.server.version
3 value: ">= 1.32.4"
4 - name: OS.release.ID
5 value: ubuntu
6 - name: OS.release.VERSION_ID
7 value: "24.04"

Common measurement paths

PathExample
K8s.server.version1.32.4
OS.release.IDubuntu, rhel
OS.release.VERSION_ID24.04
GPU.smi.driver-version580.82.07

Operators: >=, <=, >, <, ==, !=, or exact match (no operator)

Add constraints when: recipe needs specific K8s features, driver versions, OS capabilities, or hardware. Skip when universal or redundant with component self-checks.

Validation Phases

Optional multi-phase validation beyond basic constraints:

1# expectedResources are declared on componentRefs, not under validation
2componentRefs:
3 - name: gpu-operator
4 type: Helm
5 expectedResources:
6 - kind: Deployment
7 name: gpu-operator
8 namespace: gpu-operator
9 - kind: DaemonSet
10 name: nvidia-driver-daemonset
11 namespace: gpu-operator
12
13validation:
14 # Readiness phase has no checks — constraints are evaluated inline from snapshot.
15 deployment:
16 checks: [expected-resources]
17 performance:
18 infrastructure: nccl-doctor
19 checks: [nccl-bandwidth-test]

Phases: deployment, performance, conformance (readiness constraints are evaluated implicitly)

Testing

$# Validate constraints
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
$
$# Phase-specific
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment
$
$# Run validation tests
$go test -v ./pkg/recipe/... -run TestConstraintPathsUseValidMeasurementTypes

Working with Recipes

Adding a New Recipe

When: new platform, hardware, workload type, or combined criteria

Steps:

  1. Create overlay in recipes/overlays/ with criteria and componentRefs
  2. If the recipe shares OS constraints or platform components with other overlays, reference existing mixins via spec.mixins instead of duplicating (or create new mixins in recipes/mixins/)
  3. Create component values files if using valuesFile
  4. Run tests: make test
  5. Test generation: aicr recipe --service eks --accelerator gb200 --format yaml

Example:

1# recipes/overlays/gb200-eks-ubuntu-training.yaml
2apiVersion: aicr.nvidia.com/v1alpha1
3kind: RecipeMetadata
4metadata:
5 name: gb200-eks-ubuntu-training
6spec:
7 base: eks-training
8 criteria:
9 service: eks
10 accelerator: gb200
11 os: ubuntu
12 intent: training
13 componentRefs:
14 - name: gpu-operator
15 version: v26.3.1
16 valuesFile: components/gpu-operator/eks-gb200-training.yaml

Updating Recipes

Updating versions:

1# Update component version
2componentRefs:
3 - name: gpu-operator
4 version: v26.3.1 # Changed from v26.3.0

Adding components:

1componentRefs:
2 - name: new-component
3 version: v1.0.0
4 valuesFile: components/new-component/values.yaml
5 dependencyRefs: [existing-component] # Optional

Test changes: aicr recipe --service eks --accelerator gb200 --format yaml

Best Practices

Do:

  • Use minimum criteria fields needed for matching
  • Keep base recipe universal and conservative
  • Use mixins for shared OS constraints or platform components instead of duplicating across leaf overlays
  • Always explain why settings exist (1-2 sentences)
  • Follow naming conventions (\{accel\}-\{service\}-\{os\}-\{intent\}-\{platform\})
  • Run make test before committing
  • Test recipe generation after changes

Don’t:

  • Add environment-specific settings to base
  • Over-specify criteria (too narrow = fewer matches)
  • Create duplicate criteria combinations
  • Duplicate OS or platform content across leaf overlays (use mixins instead)
  • Skip validation tests
  • Forget to update context when values change

Testing and Validation

Automated Tests

Tests in pkg/recipe/yaml_test.go validate:

  • Schema conformance (YAML structure)
  • Criteria enum values (service, accelerator, intent, OS, platform)
  • File references (valuesFile, dependencyRefs)
  • Constraint syntax (measurement paths, operators)
  • No duplicate criteria
  • Merge consistency
  • No dependency cycles

Running Tests

$make test # All tests
$go test -v ./pkg/recipe/... # Recipe tests only
$go test -v ./pkg/recipe/... -run TestAllMetadataFilesConformToSchema # Specific test

Test Workflow

  1. Create recipe file in recipes/
  2. Run make test to validate
  3. Test generation: aicr recipe --service eks --accelerator gb200 --format yaml
  4. Inspect bundle: aicr bundle -r recipe.yaml -o ./test-bundles

Tests run automatically on PRs, main pushes, and release builds.

Advanced Topics

External Data Sources

Integrators can extend or override embedded recipe data using the --data flag without modifying the OSS codebase. This enables:

  • Custom recipes for proprietary hardware
  • Private component values with organization-specific settings
  • Extended registries with internal Helm charts
  • Rapid iteration without rebuilding binaries
  • New criteria values (service / accelerator / OS / intent / platform) admitted at runtime via the catalog-driven criteria registry — no rebuild required

See Data Extension for the full walkthrough (folder layout, registry rules, strict mode, debugging). The summary below is for quick reference.

Directory structure

./my-data/
├── registry.yaml # Extends/overrides component registry
├── overlays/
│ └── custom-recipe.yaml # New or override existing recipe
├── mixins/
│ └── os-custom.yaml # Custom mixin fragments
└── components/
└── my-operator/
└── values.yaml # Component values

Usage:

$# Recipe generation
$aicr recipe --service eks --accelerator gb200 --data ./my-data --output recipe.yaml
$
$# Bundle generation
$aicr bundle --recipe recipe.yaml --data ./my-data --deployer argocd --output ./bundle
$
$# Debug loading
$aicr --debug recipe --service eks --data ./my-data

Precedence: Embedded data (lowest) → External data (highest)

Behavior:

  • Overlays: Same metadata.name replaces embedded
  • Registry: Merged; same-named components replaced
  • Values: External valuesFile references take precedence
  • Criteria values: External overlays’ spec.criteria values become valid CLI / API inputs at runtime via the criteria registry; --criteria-strict (or AICR_CRITERIA_STRICT=1) rejects external-only values for OSS CI gates

Validation:

$aicr --debug recipe --service eks --data ./my-data --dry-run
$aicr recipe --service eks --data ./my-data --output /dev/stdout

Regional registry overrides

A handful of components ship images from regional, account-scoped container registries rather than a single public URI. The clearest example today is the AWS EFA device plugin, whose canonical home is <account>.dkr.ecr.<region>.amazonaws.com/eks/aws-efa-k8s-device-plugin — a per-region private ECR that every EKS node is auto-authorized to pull from. AWS publishes these add-ons regionally for three reasons: pulls go over the AWS internal backbone (no NAT egress), no Docker Hub / public-registry rate limits, and the image stays available even when the public internet or another region is degraded.

AICR ships a sensible default for each such image (e.g., us-west-2 for aws-efa), but customers deploying in a different region need to override the registry’s region segment. Two override paths cover the common cases:

Bundle-time override (single region per bundle). Use --set to bake a specific region into the bundle:

$aicr bundle --recipe recipe.yaml \
> --set awsefa:image.repository=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin \
> -o ./bundle

Install-time override (one bundle, many regions). Use --dynamic to declare the path as install-time-fillable, then provide the value via helm install --set (or your GitOps tool):

$aicr bundle --recipe recipe.yaml \
> --dynamic awsefa:image.repository \
> --deployer helm \
> -o ./bundle
$
$# Per-cluster install
$helm install ... --set image.repository=602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/aws-efa-k8s-device-plugin

--dynamic is supported with helm, argocd-helm, and flux deployers; argocd does not support it (use argocd-helm instead). See Dynamic Install-Time Values for the broader pattern.

Partition-aware variants. Standard AWS uses account ID 602401143452. GovCloud and China use different accounts and URI suffixes:

PartitionAccount IDURI shape
aws (standard)602401143452<account>.dkr.ecr.<region>.amazonaws.com
aws-us-gov (GovCloud)013241004608<account>.dkr.ecr.<region>.amazonaws.com
aws-cn (China)961992271922<account>.dkr.ecr.<region>.amazonaws.com.cn

Substitute the appropriate account and suffix in the --set / install-time value.

Troubleshooting

Debug overlay matching:

$aicr recipe --service eks --accelerator gb200 --format json | jq '.metadata.appliedOverlays'
$aicr recipe --service eks --accelerator gb200 --format json | jq '.componentRefs[].version'

Common issues:

IssueSolution
Test: “duplicate criteria”Combine overlays or differentiate criteria
Test: “valuesFile not found”Create file or fix path in recipe
Test: “unknown component”Use registered bundler name
Recipe returns emptyCheck criteria fields match query
Wrong values in bundleVerify merge precedence (base → valuesFile → overrides)

Validation:

$make qualify # Full qualification
$make test # All tests
$aicr recipe --service eks --accelerator gb200 --format yaml # Test generation

Submitting Your Recipe

Recipes that target hardware AICR maintainers cannot independently re-run require an evidence bundle so a reviewer can verify the recipe without owning the hardware. The bundle is a signed, OCI-distributed artifact that captures the resolved recipe, the cluster snapshot, the validator phase results, a CycloneDX BOM, and a manifest of per-file hashes. It is produced by adding two flags to the same aicr validate invocation you already use to check the recipe — no separate build step.

When You Need Evidence

You need an evidence bundle when your PR adds or changes a recipe whose criteria reach hardware or a service that AICR maintainers cannot independently re-run — most non-H100 GPUs, non-EKS services, and specialty fabrics fall into this bucket. The recipe-evidence CI gate posts a sticky Markdown comment on every PR touching recipes/** and fails closed when a touched recipe has no matching recipes/evidence/<recipe>.yaml pointer.

Non-material edits (comments, formatting, displayName, description, key-order) produce the same material-slice digest and do not require a fresh bundle — the existing pointer stays valid. The CI gate’s canonicalizer collapses these to the same digest, so the gate passes without re-attestation. See ADR-007 § Material-slice canonicalization for the slice definition.

Producing the Bundle

Run aicr validate against the cluster that exercises your recipe and add --emit-attestation (writes the bundle to disk) and --push (signs and uploads the OCI artifact):

$# 1. Capture snapshot and resolve the recipe you're contributing.
>aicr snapshot --output snapshot.yaml
>aicr recipe --service eks --accelerator gb200 --os ubuntu \
> --intent training --output recipe.yaml
>
># 2. Validate with attestation emission. Replace the OCI ref with a
># registry you control (GHCR, GitLab Container Registry, Harbor,
># AWS ECR, Google Artifact Registry, Azure Container Registry,
># or JFrog Artifactory — any OCI 1.1 registry with Referrers API
># support).
>aicr validate \
> --recipe recipe.yaml \
> --snapshot snapshot.yaml \
> --emit-attestation ./out \
> --push ghcr.io/<owner>/aicr-evidence
>
># 3. Commit the pointer. The bundle bytes live in OCI; the repo
># only stores the locator.
>mkdir -p recipes/evidence
>cp ./out/pointer.yaml recipes/evidence/<recipe-name>.yaml
>git add recipes/evidence/<recipe-name>.yaml

--push triggers cosign keyless signing through Sigstore’s public-good infrastructure. The CLI resolves an OIDC token through the precedence chain documented under --identity-token; if no pre-fetched token, ambient GitHub Actions OIDC, or --oidc-device-flow is available, it opens a browser. The signed attestation.intoto.jsonl is attached to the OCI artifact as a Sigstore Bundle referrer.

For the full bundle layout, flag reference, and registry compatibility notes, see Emitting recipe evidence for a PR. For the producer-and-consumer walkthrough end-to-end, see Recipe Evidence Demo.

Self-Verifying Before You Open the PR

Run the verifier locally — it is the same code the CI gate runs against the committed pointer, so failures here will block merge:

$aicr evidence verify recipes/evidence/<recipe-name>.yaml

Exit 0 means signature, schema, inventory, manifest hashes, fingerprint match against the recipe’s criteria, and BOM cross-reference all passed. A non-zero exit writes a structured Markdown report describing the specific check that failed. See aicr evidence verify for the full check list and exit-code semantics.

What to Include in the PR

The recipe-evidence CI gate posts a Markdown summary as a sticky comment, so you do not need to inline the verifier output. The PR template asks for three additional pieces of context the verifier cannot infer:

  • The OCI ref of the pushed bundle, digest-pinned, so a maintainer can audit it directly: ghcr.io/<owner>/aicr-evidence@sha256:<digest>.
  • The cluster you attested from — cloud, accelerator SKU, OS, Kubernetes version, node count. The fingerprint dimensions are in the predicate, but the human description is what the maintainer reads first.
  • Evidence disposition. If aicr evidence verify reported a non-zero exit with a 1 in the JSON output’s exit field (signature valid, recorded phase results show failures), include a short justification in the PR template’s “Evidence disposition” section. The maintainer either applies the evidence/known-failure label and merges, or requests changes. See Exit-1 Review Process for what counts as an acceptable reason — broadly: optional check not applicable to your hardware, performance ceiling limited by your test bed, or a validator under known active rework.

If You Cannot Push to a Registry

You can still produce a bundle locally without --push. The resulting ./out/summary-bundle/ directory is unsigned but otherwise complete:

$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \
> --emit-attestation ./out
$aicr evidence verify ./out/summary-bundle

The verifier records the signature step as “skipped (unsigned)” and the manifest-hash chain becomes self-consistency only — useful for catching accidental corruption during development, but not acceptable for the CI gate, which requires a signed bundle bound to a committed pointer.

  • For mechanical changes that touch recipes/** but carry no recipe semantics (file renames, comment-only changes, license header sweeps, self-bootstrapping evidence-pipeline changes), ask a maintainer to apply evidence/exempt per the bypass policy. Self-applying that label is not appropriate.

“I don’t have the hardware right now, please merge” is not a valid exempt path — see the bypass policy’s “Inappropriate uses.”

Reference


See Also