Recipe Development Guide

View as Markdown

This guide covers how to create, modify, and validate recipe metadata.

Quick Start: Contributing a Recipe

New to recipe development? Follow these minimal steps to contribute:

1. Copy an existing overlay (details)

$cp recipes/overlays/h100-eks-ubuntu-training.yaml recipes/overlays/gb200-eks-ubuntu-training.yaml

2. Edit criteria and components (criteria, components)

1# recipes/overlays/gb200-eks-ubuntu-training.yaml
2spec:
3 base: eks-training # Inherit from intermediate recipe
4 criteria:
5 service: eks
6 accelerator: gb200 # Changed from h100
7 os: ubuntu
8 intent: training
9 componentRefs:
10 - name: gpu-operator
11 version: v25.3.4
12 valuesFile: components/gpu-operator/eks-gb200-training.yaml
13 overrides:
14 driver:
15 version: "580.82.07" # GB200-specific driver

3. Run tests (details)

$make test # Validates schema, criteria, references, constraints
$make qualify # Includes end to end tests before submitting

4. Open PR (best practices)

  • Include test output showing recipe generation works
  • Explain why the recipe is needed (new hardware, workload, platform)

Overview

Recipe metadata files define component configurations for GPU-accelerated Kubernetes deployments using a base-plus-overlay architecture with three composition mechanisms — single-parent inheritance, explicit mixin composition, and criteria-wildcard matching:

  • Base values (overlays/base.yaml) - universal defaults
  • Intermediate recipes (eks.yaml, eks-training.yaml) - shared configurations for categories
  • Leaf recipes (gb200-eks-ubuntu-training.yaml) - hardware/workload-specific overrides
  • Mixins (mixins/*.yaml) - composable fragments (OS constraints, platform components) that leaf overlays reference via spec.mixins instead of duplicating content
  • Criteria-wildcard overlays (gb200-any-training.yaml) - cross-cutting overlays picked up automatically by the resolver when their wildcard criteria match the query, without being referenced via spec.base or spec.mixins
  • Inline overrides - per-recipe customization without new files

Recipe files in recipes/ are embedded at compile time. Integrators can extend or override using the --data flag (see Advanced Topics).

For query matching and overlay merging internals, see Data Architecture.

Recipe Structure

Multi-Level Inheritance

Recipes use spec.base to inherit configurations. Chains progress from general (base) to specific (leaf):

base.yaml → eks.yaml → eks-training.yaml → gb200-eks-ubuntu-training.yaml

Intermediate recipes (partial criteria) capture shared configs:

1# eks-training.yaml
2spec:
3 base: eks
4 criteria:
5 service: eks
6 intent: training # Partial - no accelerator/OS
7 componentRefs:
8 - name: gpu-operator
9 valuesFile: components/gpu-operator/values-eks-training.yaml

Leaf recipes (complete criteria) match user queries:

1# gb200-eks-ubuntu-training.yaml
2spec:
3 base: eks-training # Inherits from intermediate
4 criteria:
5 service: eks
6 accelerator: gb200
7 os: ubuntu
8 intent: training # Complete
9 componentRefs:
10 - name: gpu-operator
11 overrides:
12 driver:
13 version: "580.82.07" # Hardware-specific override

Leaf recipes with mixins compose shared fragments:

1# h100-eks-ubuntu-training-kubeflow.yaml
2spec:
3 base: h100-eks-ubuntu-training
4 mixins:
5 - os-ubuntu # Shared Ubuntu constraints (from recipes/mixins/)
6 - platform-kubeflow # Kubeflow trainer component (from recipes/mixins/)
7 criteria:
8 service: eks
9 accelerator: h100
10 os: ubuntu
11 intent: training
12 platform: kubeflow

Mixins use kind: RecipeMixin and carry only constraints and componentRefs. They live in recipes/mixins/ and are applied after inheritance chain merging. See Data Architecture for details.

Cross-cutting overlays with wildcard criteria apply across one criteria dimension without being referenced via spec.base or listed in spec.mixins. The resolver can return multiple independent maximal-leaf overlays for a single query, so a service: any overlay is picked up alongside the service-specific maximal leaf and its inheritance chain:

1# gb200-any-training.yaml — applies to every GB200+training query
2spec:
3 base: base
4 criteria:
5 service: any # Wildcard — matches eks, oke, gke, etc.
6 accelerator: gb200
7 intent: training
8 validation:
9 performance:
10 checks:
11 - nccl-all-reduce-bw # Required: selects which validators run
12 constraints:
13 - name: nccl-all-reduce-bw
14 value: ">= 720"

Only use this pattern when the content is truly uniform across the wildcard dimension — if values diverge per service, keep them inline in each service-specific overlay. See Data Architecture for when to use wildcard overlays vs mixins.

Merge order: base.yaml (lowest) → intermediate → leaf → mixins (highest)

Merge rules:

  • Constraints: same-named overridden, new added
  • ComponentRefs: same-named merged field-by-field, new added
  • Criteria: not inherited (each recipe defines its own)
  • Mixin constraints/components must not conflict with the inheritance chain or other mixins

Component Types

Helm components (most common):

1componentRefs:
2 - name: gpu-operator
3 type: Helm
4 version: v25.3.4
5 valuesFile: components/gpu-operator/values.yaml
6 overrides:
7 driver:
8 version: "580.82.07"

Kustomize components

1componentRefs:
2 - name: my-app
3 type: Kustomize
4 source: https://github.com/example/my-app
5 tag: v1.0.0
6 path: deploy/production

A component must have either helm OR kustomize configuration, not both.

Component Configuration

Configuration Patterns

Pattern 1: ValuesFile only (large, reusable configs)

1componentRefs:
2 - name: cert-manager
3 valuesFile: components/cert-manager/eks-values.yaml

Pattern 2: Overrides only (small, recipe-specific configs)

1componentRefs:
2 - name: nvsentinel
3 overrides:
4 namespace: nvsentinel
5 sentinel:
6 enabled: true

Pattern 3: Hybrid (shared base + recipe tweaks)

1componentRefs:
2 - name: gpu-operator
3 valuesFile: components/gpu-operator/eks-gb200-training.yaml
4 overrides:
5 driver:
6 version: "580.82.07" # Override just this field

Value Merge Precedence

Values merge from lowest to highest precedence:

Base → ValuesFile → Overrides → CLI --set flags

Deep merge: only specified fields replaced, unspecified preserved. Arrays replaced entirely (not element-by-element).

Example:

1# Base: driver.version="550.54.15", driver.repository="nvcr.io/nvidia"
2# ValuesFile: driver.version="570.86.16"
3# Override: driver.version="580.13.01"
4# Result: driver.version="580.13.01", driver.repository="nvcr.io/nvidia" (preserved)

File Naming Conventions

File names are for human readability—matching uses spec.criteria, not file names.

Overlay naming: {accelerator}-{service}-{os}-{intent}-{platform}.yaml (platform always last)

File TypePatternExample
Service{service}.yamleks.yaml
Service + intent{service}-{intent}.yamleks-training.yaml
Full criteria{accel}-{service}-{os}-{intent}.yamlgb200-eks-ubuntu-training.yaml
+ platform{accel}-{service}-{os}-{intent}-{platform}.yamlgb200-eks-ubuntu-training-kubeflow.yaml
Mixin (OS)os-{os}.yamlos-ubuntu.yaml
Mixin (platform)platform-{platform}.yamlplatform-kubeflow.yaml
Component valuesvalues-{service}-{intent}.yamlvalues-eks-training.yaml

Constraints and Validation

Constraints

Constraints validate deployment requirements against cluster snapshots:

1constraints:
2 - name: K8s.server.version
3 value: ">= 1.32.4"
4 - name: OS.release.ID
5 value: ubuntu
6 - name: OS.release.VERSION_ID
7 value: "24.04"

Common measurement paths

PathExample
K8s.server.version1.32.4
OS.release.IDubuntu, rhel
OS.release.VERSION_ID24.04
GPU.smi.driver-version580.82.07

Operators: >=, <=, >, <, ==, !=, or exact match (no operator)

Add constraints when: recipe needs specific K8s features, driver versions, OS capabilities, or hardware. Skip when universal or redundant with component self-checks.

Validation Phases

Optional multi-phase validation beyond basic constraints:

1# expectedResources are declared on componentRefs, not under validation
2componentRefs:
3 - name: gpu-operator
4 type: Helm
5 expectedResources:
6 - kind: Deployment
7 name: gpu-operator
8 namespace: gpu-operator
9 - kind: DaemonSet
10 name: nvidia-driver-daemonset
11 namespace: gpu-operator
12
13validation:
14 # Readiness phase has no checks — constraints are evaluated inline from snapshot.
15 deployment:
16 checks: [expected-resources]
17 performance:
18 infrastructure: nccl-doctor
19 checks: [nccl-bandwidth-test]

Phases: deployment, performance, conformance (readiness constraints are evaluated implicitly)

Testing

$# Validate constraints
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
$
$# Phase-specific
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment
$
$# Run validation tests
$go test -v ./pkg/recipe/... -run TestConstraintPathsUseValidMeasurementTypes

Working with Recipes

Adding a New Recipe

When: new platform, hardware, workload type, or combined criteria

Steps:

  1. Create overlay in recipes/overlays/ with criteria and componentRefs
  2. If the recipe shares OS constraints or platform components with other overlays, reference existing mixins via spec.mixins instead of duplicating (or create new mixins in recipes/mixins/)
  3. Create component values files if using valuesFile
  4. Run tests: make test
  5. Test generation: aicr recipe --service eks --accelerator gb200 --format yaml

Example:

1# recipes/overlays/gb200-eks-ubuntu-training.yaml
2apiVersion: aicr.nvidia.com/v1alpha1
3kind: RecipeMetadata
4metadata:
5 name: gb200-eks-ubuntu-training
6spec:
7 base: eks-training
8 criteria:
9 service: eks
10 accelerator: gb200
11 os: ubuntu
12 intent: training
13 componentRefs:
14 - name: gpu-operator
15 version: v25.3.4
16 valuesFile: components/gpu-operator/eks-gb200-training.yaml

Updating Recipes

Updating versions:

1# Update component version
2componentRefs:
3 - name: gpu-operator
4 version: v25.3.4 # Changed from v25.3.3

Adding components:

1componentRefs:
2 - name: new-component
3 version: v1.0.0
4 valuesFile: components/new-component/values.yaml
5 dependencyRefs: [existing-component] # Optional

Test changes: aicr recipe --service eks --accelerator gb200 --format yaml

Best Practices

Do:

  • Use minimum criteria fields needed for matching
  • Keep base recipe universal and conservative
  • Use mixins for shared OS constraints or platform components instead of duplicating across leaf overlays
  • Always explain why settings exist (1-2 sentences)
  • Follow naming conventions ({accel}-{service}-{os}-{intent}-{platform})
  • Run make test before committing
  • Test recipe generation after changes

Don’t:

  • Add environment-specific settings to base
  • Over-specify criteria (too narrow = fewer matches)
  • Create duplicate criteria combinations
  • Duplicate OS or platform content across leaf overlays (use mixins instead)
  • Skip validation tests
  • Forget to update context when values change

Testing and Validation

Automated Tests

Tests in pkg/recipe/yaml_test.go validate:

  • Schema conformance (YAML structure)
  • Criteria enum values (service, accelerator, intent, OS, platform)
  • File references (valuesFile, dependencyRefs)
  • Constraint syntax (measurement paths, operators)
  • No duplicate criteria
  • Merge consistency
  • No dependency cycles

Running Tests

$make test # All tests
$go test -v ./pkg/recipe/... # Recipe tests only
$go test -v ./pkg/recipe/... -run TestAllMetadataFilesConformToSchema # Specific test

Test Workflow

  1. Create recipe file in recipes/
  2. Run make test to validate
  3. Test generation: aicr recipe --service eks --accelerator gb200 --format yaml
  4. Inspect bundle: aicr bundle -r recipe.yaml -o ./test-bundles

Tests run automatically on PRs, main pushes, and release builds.

Advanced Topics

External Data Sources

Integrators can extend or override embedded recipe data using the --data flag without modifying the OSS codebase. This enables:

  • Custom recipes for proprietary hardware
  • Private component values with organization-specific settings
  • Extended registries with internal Helm charts
  • Rapid iteration without rebuilding binaries

Directory structure

./my-data/
├── registry.yaml # Extends/overrides component registry
├── overlays/
│ └── custom-recipe.yaml # New or override existing recipe
├── mixins/
│ └── os-custom.yaml # Custom mixin fragments
└── components/
└── my-operator/
└── values.yaml # Component values

Usage:

$# Recipe generation
$aicr recipe --service eks --accelerator gb200 --data ./my-data --output recipe.yaml
$
$# Bundle generation
$aicr bundle --recipe recipe.yaml --data ./my-data --deployer argocd --output ./bundle
$
$# Debug loading
$aicr --debug recipe --service eks --data ./my-data

Precedence: Embedded data (lowest) → External data (highest)

Behavior:

  • Overlays: Same metadata.name replaces embedded
  • Registry: Merged; same-named components replaced
  • Values: External valuesFile references take precedence

Validation:

$aicr --debug recipe --service eks --data ./my-data --dry-run
$aicr recipe --service eks --data ./my-data --output /dev/stdout

Regional registry overrides

A handful of components ship images from regional, account-scoped container registries rather than a single public URI. The clearest example today is the AWS EFA device plugin, whose canonical home is <account>.dkr.ecr.<region>.amazonaws.com/eks/aws-efa-k8s-device-plugin — a per-region private ECR that every EKS node is auto-authorized to pull from. AWS publishes these add-ons regionally for three reasons: pulls go over the AWS internal backbone (no NAT egress), no Docker Hub / public-registry rate limits, and the image stays available even when the public internet or another region is degraded.

AICR ships a sensible default for each such image (e.g., us-west-2 for aws-efa), but customers deploying in a different region need to override the registry’s region segment. Two override paths cover the common cases:

Bundle-time override (single region per bundle). Use --set to bake a specific region into the bundle:

$aicr bundle --recipe recipe.yaml \
> --set awsefa:image.repository=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin \
> -o ./bundle

Install-time override (one bundle, many regions). Use --dynamic to declare the path as install-time-fillable, then provide the value via helm install --set (or your GitOps tool):

$aicr bundle --recipe recipe.yaml \
> --dynamic awsefa:image.repository \
> --deployer helm \
> -o ./bundle
$
$# Per-cluster install
$helm install ... --set image.repository=602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/aws-efa-k8s-device-plugin

--dynamic is supported with helm and argocd-helm deployers; argocd does not support it (use argocd-helm instead). See Dynamic Install-Time Values for the broader pattern.

Partition-aware variants. Standard AWS uses account ID 602401143452. GovCloud and China use different accounts and URI suffixes:

PartitionAccount IDURI shape
aws (standard)602401143452<account>.dkr.ecr.<region>.amazonaws.com
aws-us-gov (GovCloud)013241004608<account>.dkr.ecr.<region>.amazonaws.com
aws-cn (China)961992271922<account>.dkr.ecr.<region>.amazonaws.com.cn

Substitute the appropriate account and suffix in the --set / install-time value.

Troubleshooting

Debug overlay matching:

$aicr recipe --service eks --accelerator gb200 --format json | jq '.metadata.appliedOverlays'
$aicr recipe --service eks --accelerator gb200 --format json | jq '.componentRefs[].version'

Common issues:

IssueSolution
Test: “duplicate criteria”Combine overlays or differentiate criteria
Test: “valuesFile not found”Create file or fix path in recipe
Test: “unknown component”Use registered bundler name
Recipe returns emptyCheck criteria fields match query
Wrong values in bundleVerify merge precedence (base → valuesFile → overrides)

Validation:

$make qualify # Full qualification
$make test # All tests
$aicr recipe --service eks --accelerator gb200 --format yaml # Test generation

See Also