Recipes, Overlays, and Mixins

View as Markdown

The recipe data layer is the rule-based engine that turns a Criteria query ({service, accelerator, intent, os, platform, nodes}) into a resolved RecipeResult — the merged spec, component refs, deployment order, and validation phases that aicr bundle consumes.

This page covers everything related to AICR recipes for contributors: the three layers that contribute data (registry, overlay, mixin), the on-disk schemas for each, the resolver’s merge algorithm, and the invariants the resolver enforces. End-user recipe authoring lives in recipe-development.md; this page is for contributors changing recipe content or extending the resolver in pkg/recipe.

Where does my change go? Most changes hit exactly one of three files. Skim Decision matrix before editing — picking the wrong layer leaks defaults across recipes or duplicates content across overlays.

Layered Model

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Registry │ │ Mixin │ │ Overlay │
│ recipes/ │ │ recipes/ │ │ recipes/ │
│ registry.yaml│ │ mixins/*.yaml│ │ overlays/ │
│ │ │ │ │ │
│ Component │ │ Composable │ │ Criteria- │
│ catalog + │ │ fragment │ │ matched │
│ defaults │ │ (constraints │ │ recipe with │
│ (chart, ns, │ │ + componentRefs)│ │ spec.base │
│ scheduling) │ │ │ │ inheritance │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┴─────────────────┘
┌────────────────────┐
│ RecipeResult │
│ (merged, ordered, │
│ validated) │
└────────────────────┘

Resolution: the resolver loads the base spec (overlays/base.yaml) as the merge seed, then merges each matching overlay’s inheritance chain on top (base → … → leaf), then applies the leaf’s mixins, then finally injects registry defaults for any component field the chain left unset. Per-component values files (recipes/components/<name>/) are pulled in at bundle time, not at recipe resolution.

Decision Matrix

Registry entryOverlayMixin
PurposeMake a chart/kustomization available to recipes; set component-wide defaultsPin versions, set values, scope by criteriaShare constraints or componentRefs across overlays
AuthorityComponent-wide (one entry per component)Criteria-matched (selected by query)Opt-in (referenced via spec.mixins)
Filerecipes/registry.yaml (one entry)recipes/overlays/<name>.yaml (one file per shape)recipes/mixins/<name>.yaml (one file per fragment)
LifecycleAdd once; bump defaultVersion on chart upgradeAdd per cluster shape; cull when shape retiresStable; new mixin only when ≥ 2 leaves duplicate the same block
KindComponentRegistryRecipeMetadataRecipeMixin
Carries criteria?NoYes (spec.criteria)No (rejected at load)
Carries base?NoYes (single-parent chain)No
Example”make gpu-operator available, default to chart v25.10.1""for eks + gb200 + training + ubuntu, pin K8s ≥ 1.32.4""for any overlay opting in via mixins: [os-ubuntu], require Ubuntu 24.04”

Rule of thumb: a change targeting all recipes goes in registry; a change targeting one cluster shape goes in an overlay; a change shared by ≥ 2 overlays as an opt-in fragment goes in a mixin.

Registry (recipes/registry.yaml)

The registry is the component catalog. Each entry declares a chart or kustomization the resolver can reference and supplies defaults the resolver injects into any ComponentRef that leaves the field unset.

Top-level schema (ComponentRegistry):

1apiVersion: aicr.nvidia.com/v1alpha1
2kind: ComponentRegistry
3components:
4 - name: <component-id>
5 ...

ComponentConfig fields (see pkg/recipe/components.go):

FieldTypeRequiredPurpose
namestringyesComponent identifier (matches componentRefs[].name in overlays)
displayNamestringyesHuman label used in CLI output and bundle templates
valueOverrideKeys[]stringnoAlt keys for --set <key>:path=value (e.g., gpuoperator)
helmHelmConfigone of helm/kustomizeChart defaults (see below)
kustomizeKustomizeConfigone of helm/kustomizeKustomization defaults (see below)
nodeSchedulingNodeSchedulingConfignoHelm value paths for injecting selectors/tolerations/taints (system, accelerated, plus nodeCountPaths)
podSchedulingPodSchedulingConfignoHelm value paths for workload pod scheduling injection
storageClassPaths[]stringnoHelm value paths where --storage-class is written
validations[]ComponentValidationConfignoBundle-time validation checks (function, severity, conditions, message)
healthCheck.assertFilestringyesChainsaw assert YAML (relative to data dir) consumed by aicr validate --phase deployment (runtime — #1220) and by make check-health locally. Content is restricted to the read-only assert / error operation allowlist. Enforced at PR time by pkg/recipe.TestComponentRegistry_RequiresHealthCheck (every component must declare a path) and validators/chainsaw.TestValidateTestReadOnly_RegistryContent (every declared path must pass the allowlist) — see #1223.
gkeCriticalPriorityboolnoSynthesize ResourceQuota on GKE so system-*-critical pods admit
hasSelfRefCRDsboolnoTells helmfile to emit disableValidation: true (chart ships CRD + CR in same release)

HelmConfig: defaultRepository, defaultChart, defaultVersion, defaultNamespace. KustomizeConfig: defaultSource, defaultPath, defaultTag. A component must have either helm or kustomize, not both.

pkg/component/generic.go carries a ComponentConfig marked Deprecated: — that is a separate, unused-in-production legacy type; the live ComponentConfig is the one in pkg/recipe/components.go.

Defaults flow into a ComponentRef only when the field is empty — see applyRegistryDefaults below.

Overlay (recipes/overlays/)

An overlay is a RecipeMetadata document with a spec.criteria block that selects it for matching queries. Overlays live in recipes/overlays/ and inherit single-parent via spec.base.

1kind: RecipeMetadata
2apiVersion: aicr.nvidia.com/v1alpha1
3metadata:
4 name: gb200-eks-ubuntu-training
5spec:
6 base: gb200-eks-training # inheritance chain root → leaf
7 mixins: # composed AFTER inheritance merge
8 - os-ubuntu
9 criteria:
10 service: eks # query → overlay selection
11 accelerator: gb200
12 os: ubuntu
13 intent: training
14 # platform: kubeflow # optional 6th dimension
15 constraints: # OS/K8s/GPU/SystemD constraints
16 - name: K8s.server.version
17 value: ">= 1.32.4"
18 componentRefs: [] # overrides on inherited components
19 validation: # per-phase validation config
20 readiness: { ... }
21 deployment: { ... }
22 performance: { ... }
23 conformance: { ... }

Criteria fields (see pkg/recipe/criteria.go type Criteria):

FieldTypeWildcardStatic OSS values
serviceCriteriaServiceTypeany or emptyeks, gke, aks, oke, kind, lke, bcm
acceleratorCriteriaAcceleratorTypeany or emptyh100, h200, gb200, b200, a100, l40, rtx-pro-6000
intentCriteriaIntentTypeany or emptytraining, inference
osCriteriaOSTypeany or emptyubuntu, rhel, cos, amazonlinux, talos
platformCriteriaPlatformTypeany or emptydynamo, kubeflow, nim, runai, slurm
nodesint0any positive int

--data overlays may contribute additional values via the criteria registry — Has(FieldX, ...) is consulted when a value misses the fast-path switch in Parse<X>. Adding a new value to a Go enum (e.g., a new accelerator) is multi-file work; audit CriteriaAccelerator* callers as listed in CLAUDE.md before merging.

Specificity. Each criteria carries a specificity score equal to the count of non-any, non-empty fields. The current Specificity() in criteria.go counts six fields: service, accelerator, intent, os, platform, nodes. Overlays are sorted by specificity ascending, so less-specific overlays merge first.

Matching is asymmetric. Recipe-side any is a wildcard (matches anything in the query); query-side any is not a wildcard (matches only recipe-side any). A generic query never resolves to a hardware-specific recipe. See MatchesCriteriaField in criteria.go.

Inheritance. spec.base walks a single-parent chain from leaf → … → base (the root spec, held separately on the metadata store). Cycles are detected at catalog load. Per-field merge: constraints merge by name (later wins on same name; new appended); componentRefs merge by name field-by-field; criteria are not inherited (each recipe declares its own).

Leaf. A leaf is the most specific overlay in a chain — the terminal node carrying fully-qualified criteria (every relevant dimension set, e.g. service + accelerator + os + intent + platform) that an end-user query actually resolves to. A leaf usually adds little of its own (often componentRefs: []); its job is to bind one inheritance chain plus its mixins under a single criteria fingerprint. “Base → … → leaf” throughout this page refers to walking from the root spec down to this node. Leaf is a role, not a distinct kind — every overlay is a RecipeMetadata; “leaf” just names the ones at the end of a chain.

Mixin Composition

Inheritance is single-parent, which means cross-cutting concerns (OS constraints, platform components) would otherwise duplicate across every leaf. Mixins are composable fragments referenced via spec.mixins. They live in recipes/mixins/ and use kind RecipeMixin.

1# recipes/mixins/os-ubuntu.yaml
2kind: RecipeMixin
3apiVersion: aicr.nvidia.com/v1alpha1
4metadata:
5 name: os-ubuntu
6spec:
7 constraints:
8 - name: OS.release.ID
9 value: ubuntu
10 - name: OS.release.VERSION_ID
11 value: "24.04"
12 componentRefs: [] # optional

Mixin files currently in the tree: os-ubuntu, os-talos, platform-inference, platform-kubeflow.

Mixin rules:

  • A mixin carries only constraints and componentRefs. Setting criteria, base, mixins, or validation is rejected at load.
  • Resolution order: base chain merged first, then mixins applied to the merged result. A leaf adopts a mixin by listing its file basename in spec.mixins.
  • Mixin componentRefs are restricted to additive merges via mixinComponentRefSafeForMerge (see pkg/recipe/metadata_store.go). A mixin componentRef may only set name, namespace, manifestFiles, preManifestFiles. Setting any of chart, type, source, version, tag, path, valuesFile, overrides, patches, dependencyRefs, cleanup, expectedResources, healthCheckAsserts is rejected at compose time — those fields silently override the chain’s chosen chart, so the resolver names the offending field and refuses to merge (see ADR-005 “Silent constraint override” mitigation).
  • When a snapshot evaluator is wired in, mixin constraints are evaluated against it after merging; failure invalidates the entire composed candidate. In plain query mode mixin constraints are merged but not evaluated.

Criteria Wildcard Overlays

Some overlays apply across an entire criteria dimension without being referenced via spec.base or spec.mixins. The resolver picks them up automatically because FindMatchingOverlays returns all maximal matches, not just the most specific one. Two wildcard patterns in the tree today: gb200-any.yaml (matches service: any) and monitoring-hpa.yaml (matches intent: any).

1# recipes/overlays/gb200-any.yaml
2spec:
3 base: base
4 criteria:
5 service: any # wildcard — matches eks, oke, gke, ...
6 accelerator: gb200
7 validation:
8 deployment:
9 constraints:
10 - name: Deployment.gpu-operator.version
11 value: ">= v25.10.0"

For a query {service: eks, accelerator: gb200, intent: training}, the resolver returns three independent maximal leaves — gb200-eks-training (matched by explicit criteria), gb200-any (matched by service: any), and monitoring-hpa (matched by intent: any). Each leaf’s inheritance chain is resolved separately and merged onto the base spec in specificity order.

Maximal-leaf filter. filterToMaximalLeaves (in metadata_store.go) drops any match that is a transitive spec.base ancestor of another match — ancestors re-enter the output via chain resolution, so keeping them as separate matches would double-count their contributions. Independent leaves on unrelated chains (wildcard + explicit) are kept; one is not an ancestor of the other.

When to use a wildcard overlay vs a mixin:

Use a criteria-wildcard overlay when…Use a mixin when…
Content applies based on query criteriaContent applies based on explicit opt-in
Consumer set is determined by matchingConsumer set is an enumerated list of leaves
Adopt-by-default is desired for new matching overlaysEach consumer should reference it explicitly
You need a validation block (mixins can’t carry one)You only need constraints / componentRefs

Precedence. Leaves merge in specificity-ascending order, so a service-specific leaf overrides the wildcard on same-named constraints. spec.validation.<phase> blocks merge per-field: checks and constraints union (nil = inherit, [] = clear, non-empty = union); nodeSelection and infrastructure are wholesale-replace. Don’t carry per-fabric values in a wildcard (NCCL bandwidth thresholds differ per service); reserve wildcards for content genuinely uniform across the wildcard dimension.

Merge Algorithm

The resolver lives in pkg/recipe/metadata_store.go. The merge proceeds in fixed precedence (low → high):

registry defaults → mixin → base chain → overlay leaf → CLI/API --set
(lowest priority) (highest priority)

Each step wins over everything to its left — --set overrides the overlay leaf, the leaf overrides the base chain, and so on. Read as priority, not as temporal order.

Implementation notes:

  1. Seed. initBaseMergedSpec() clones s.Base (parsed from overlays/base.yaml) into the merge target. The base spec is held separately on the metadata store; it is not an overlay candidate in FindMatchingOverlays.
  2. Chain merge. For each maximal leaf, the inheritance chain is walked root → leaf and mergedSpec.Merge(&recipe.Spec) is called for each. Same-named constraints/componentRefs override; new entries append.
  3. Mixin merge. mergeMixins(mergedSpec) walks spec.mixins on the leaf, loads each from recipes/mixins/, and appends. mixinComponentRefSafeForMerge rejects mixin componentRefs that touch identity/sourcing fields.
  4. Registry defaults. applyRegistryDefaults(provider, refs) fills in chart/version/namespace/source/tag/path defaults for any ComponentRef field still empty after the chain merge. Failure to load the registry is propagated, not swallowed — partial refs would fail downstream far from the root cause.
  5. Topological sort. TopologicalSort() orders components by dependencyRefs for the final DeploymentOrder. Cycles produce ErrCodeInvalidRequest.

Deep-copy semantics. deepMergeMap (metadata.go) recurses into nested map[string]any. Non-map values (scalars and []any) are deep-copied via serializer.DeepCopyAny so dst never aliases src’s slice values. This matters: copying []any by reference during overlay merge would let a downstream mutation (e.g., bundler appending a toleration) leak back into the cached source map and corrupt subsequent queries. The CLAUDE.md anti-patterns list calls this out — any new helper that touches overlay-derived maps must follow the same rule.

Determinism

Recipe output is reproducible: same inputs → same bytes. The data layer enforces this via two rules.

Use serializer.MarshalYAMLDeterministic for any output that feeds a digest, signature, OCI manifest, or fingerprint. yaml.v3 walks Go maps in randomized order, so two consecutive marshals of the same map[string]any produce different byte sequences. Plain yaml.Marshal is fine for human-readable scratch output but is a correctness bug anywhere a downstream consumer hashes the bytes.

Per-dimension ordered lists, not unordered maps. RecipeResult fields like appliedOverlays, componentRefs, deploymentOrder, and the per-dimension fingerprint diff are ordered slices, not maps, so iteration is deterministic.

Recipe Store Immutability

The metadata store is read-only after init. LoadMetadataStoreFor(dp) returns a sync.Once-cached *MetadataStore per DataProvider identity, so concurrent recipe builds against the same provider share the store without locks. Per-request mutations (chain resolution, constraint evaluation, registry defaulting) happen on clones, never on the cached spec.

Deferred registration. pendingRegistryEntry stages each overlay’s criteria for the per-provider criteria registry before registration. The actual Register(field, value, origin) calls only fire after every overlay parses cleanly, the base recipe is present, and dependency validation passes. Partial catalog loads never leak into the registry; a malformed overlay does not poison criteria validation for the next process.

Eviction. EvictCachedStore(provider) and EvictCachedRegistry(provider) drop a single provider’s cache entry without disturbing other providers. Use after rewriting a --data overlay on disk.

Observable RecipeResult Surfaces

RecipeResult (in pkg/recipe/metadata.go) is the resolver’s externally-visible product. Fields beyond ComponentRefs and DeploymentOrder that contributors should be aware of:

FieldPurpose
Metadata.AppliedOverlaysOrdered list of overlays merged into this result (base first, leaf last).
Metadata.ExcludedOverlaysOverlays that matched criteria but were dropped (e.g., a mixin constraint failed against the snapshot). Each carries a typed Reason (constraint-failed, mixin-constraint-failed).
Metadata.ConstraintWarningsPer-constraint detail for excluded overlays (overlay, constraint name, expected vs actual, reason text).
ValidationMulti-phase config (readiness, deployment, performance, conformance) inherited from overlay metadata.
owner (unexported)*Builder that produced this result. AssertOwnedBy(b) enforces — two builders bound to different DataProviders must not cross-read each other’s results.
provider (unexported)DataProvider that produced this result; accessed via (*RecipeResult).DataProvider(). Lets GetValuesForComponent route file reads through the originating provider even after the package-global has rotated.

ComponentRef extras beyond the chart-identity fields:

FieldPurpose
ManifestFilesExtra manifest files to bundle at sync-wave N+1 (after primary chart). Additive merge, dedup.
PreManifestFilesManifest files to bundle at sync-wave N-1 (before primary chart) — e.g., a Namespace with PSS labels the chart pods need. Additive merge, dedup; .. segments rejected at load.
ExpectedResourcesList of {Kind, Name, Namespace} the deployment phase validator asserts exist. Overlay wholesale-replaces.
HealthCheckAssertsRaw Chainsaw assert YAML loaded from the registry’s healthCheck.assertFile; overlay wins if set.
CleanupBundler uninstalls this component after validation (used for ephemeral validators like nccl-doctor).

Adding a Recipe

  1. Decide registry vs overlay vs mixin (decision matrix).
  2. Write the YAML in the correct directory. For an overlay, set spec.base to the most specific shared ancestor and let the chain carry shared constraints; only declare what differs.
  3. Ship the chainsaw health check (registry entries only). Every new component in recipes/registry.yaml MUST declare healthCheck.assertFile pointing at recipes/checks/<name>/health-check.yaml, and that file MUST use only the read-only assert / error operation allowlist (no script, apply, wait, command, etc. — see validators/chainsaw/allowlist.go). The contract is enforced at PR time by pkg/recipe.TestComponentRegistry_RequiresHealthCheck and validators/chainsaw.TestValidateTestReadOnly_RegistryContent — both gate make qualify. See #1223 and the chainsaw health check section in /aicr/contributor-guide/validators for the assertion patterns currently in use (DaemonSet numberReady == desiredNumberScheduled, Deployment Available=True, CRD Established=True).
  4. Run make bom-docs and commit docs/user/container-images.md if your change touches registry.yaml, a component’s values.yaml, or a chart version pin (see BOM regeneration).
  5. Unit tests. make test runs the recipe-resolution suite — pkg/recipe/yaml_test.go (static catalog: parse, refs, enum values, inheritance depth, no cycles) and pkg/recipe/metadata_test.go (runtime merge, topological sort). Both gate make qualify. If your change adds a registry entry, a new overlay file, or a mixin, the static suite typically picks it up without new test code.
  6. Integration validation. For a new chart pin, run make qualify and let the e2e pipeline render the bundle. KWOK simulated clusters (make kwok-e2e RECIPE=<name>) catch most resolution regressions without GPU hardware.

BOM Regeneration

docs/user/container-images.md is auto-generated from the actual rendered Helm templates of every chart referenced by the registry. It is regenerated by make bom-docs.

Run make bom-docs and commit the regenerated docs/user/container-images.md in the same PR whenever you:

  • Add or remove a component in recipes/registry.yaml
  • Bump a chart version (in registry.yaml, an overlay, or a mixin)
  • Modify a component’s values.yaml in a way that changes which images render (image repo override, subchart enable/disable, etc.)

The regen can also surface drift from upstream chart updates — when a chart bumps an image inside its own templates without a registry pin change on our side. That drift will appear in the BOM diff whether you expected it or not.

Freshness is not gated at merge time. make bom-check verifies the committed BOM matches a fresh regen, but it is opt-in only — not wired into make qualify, make lint, or the PR gate. Do not rely on local qualify or CI to catch a missed regen. Wiring bom-check into the gate is a desirable follow-up.

Common Pitfalls

  • Skipping make bom-docs after a chart pin or values change. The diff doesn’t surface in qualify; the BOM goes stale silently.
  • Mutating in place during merge. Overlay-derived map[string]any and []any must be deep-copied, not aliased. deepMergeMap does this for you; a bespoke helper that recurses into maps but copies []any by reference will alias and corrupt the cached source map.
  • Plain yaml.Marshal on output that feeds a digest. Use serializer.MarshalYAMLDeterministic for any byte sequence a downstream consumer hashes (evidence predicate body, OCI manifest, signature input, fingerprint).
  • Adding a new criteria value to the Go enum but missing call sites. A new accelerator, OS, intent, or platform value is enumerated in many files — the criteria registry, OpenAPI spec, every docs page that lists current values, issue templates, the Specificity() helper. Start from the Go type in criteria.go and follow the audit list in CLAUDE.md.
  • Setting identity fields in a mixin componentRef. A mixin may not set chart, version, valuesFile, etc. — the resolver rejects with the offending field name. Move chart-changing logic to an overlay.
  • Assuming the cluster fingerprint is trustworthy. The fingerprint block persisted in aicr snapshot output is advisory; trust-bearing consumers recompute via fingerprint.FromMeasurements(...) before acting. See the collector docs and ADR-007 for details.

See Also