Data Architecture

View as Markdown

This document describes the recipe metadata system used by the CLI and API to generate optimized system configuration recommendations (i.e. recipes) based on environment parameters.

Overview

The recipe system is a rule-based configuration engine that generates tailored system configurations by:

  1. Starting with a base recipe - Universal component definitions and constraints applicable to every recipe
  2. Matching environment-specific overlays - Targeted configurations based on query criteria (service, accelerator, OS, intent)
  3. Resolving inheritance chains - Overlays can inherit from intermediate recipes to reduce duplication
  4. Merging configurations - Components, constraints, and values are merged with overlay precedence
  5. Computing deployment order - Topological sort of components based on dependency references

The recipe data is organized in recipes/ as multiple YAML files:

recipes/
├── registry.yaml # Component registry (Helm & Kustomize configs)
├── overlays/ # Recipe overlays (including base)
│ ├── base.yaml # Root recipe - all recipes inherit from this
│ ├── eks.yaml # EKS-specific settings
│ ├── eks-training.yaml # EKS + training workloads (inherits from eks)
│ ├── gb200-eks-ubuntu-training.yaml # GB200/EKS/Ubuntu/training (inherits from eks-training)
│ └── h100-ubuntu-inference.yaml # H100/Ubuntu/inference
├── mixins/ # Composable mixin fragments (kind: RecipeMixin)
│ ├── os-ubuntu.yaml # Ubuntu OS constraints (shared by leaf overlays)
│ ├── platform-inference.yaml # Inference gateway components (shared by service-inference overlays)
│ └── platform-kubeflow.yaml # Kubeflow trainer component (shared by leaf overlays)
└── components/ # Component values files
├── cert-manager/
│ └── values.yaml
├── gpu-operator/
│ ├── values.yaml # Base GPU Operator values
│ └── values-eks-training.yaml # EKS training-optimized values
├── network-operator/
│ └── values.yaml
├── nvidia-dra-driver-gpu/
│ └── values.yaml
├── nvsentinel/
│ └── values.yaml
└── nodewright-operator/
└── values.yaml

Note: These files are embedded into both the CLI binary and API server at compile time, making the system fully self-contained with no external dependencies.

Extensibility: The embedded data can be extended or overridden using the --data flag. See External Data Provider for details.

Recipe Usage Patterns:

  1. CLI Query Mode - Direct recipe generation from criteria parameters:

    $aicr recipe --os ubuntu --accelerator h100 --service eks --intent training
  2. CLI Snapshot Mode - Analyze captured system state to infer criteria:

    $aicr snapshot --output system.yaml
    $aicr recipe --snapshot system.yaml --intent training
  3. API Server - HTTP endpoint (query mode only):

    $curl "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100&service=eks&intent=training"

Data Structure

Recipe Metadata Format

Each recipe file follows this structure:

1kind: RecipeMetadata
2apiVersion: aicr.nvidia.com/v1alpha1
3metadata:
4 name: <recipe-name> # Unique identifier (e.g., "eks-training", "gb200-eks-ubuntu-training")
5
6spec:
7 base: <parent-recipe> # Optional - inherits from another recipe
8 mixins: # Optional - composable mixin fragments
9 - os-ubuntu # OS constraints (from recipes/mixins/)
10 - platform-kubeflow # Platform components (from recipes/mixins/)
11
12 criteria: # When this recipe/overlay applies
13 service: eks # Kubernetes platform
14 accelerator: gb200 # GPU type
15 os: ubuntu # Operating system
16 intent: training # Workload purpose
17 platform: kubeflow # Platform/framework (optional)
18
19 constraints: # Deployment requirements
20 - name: K8s.server.version
21 value: ">= 1.32"
22
23 componentRefs: # Components to deploy
24 - name: gpu-operator
25 type: Helm
26 source: https://helm.ngc.nvidia.com/nvidia
27 version: v25.3.3
28 valuesFile: components/gpu-operator/values.yaml
29 dependencyRefs:
30 - cert-manager

Top-Level Fields

FieldDescription
kindAlways recipeMetadata
apiVersionAlways aicr.nvidia.com/v1alpha1
metadata.nameUnique recipe identifier
spec.baseParent recipe to inherit from (empty = inherits from overlays/base.yaml)
spec.mixinsList of mixin names to compose (e.g., ["os-ubuntu", "platform-kubeflow"])
spec.criteriaQuery parameters that select this recipe
spec.constraintsPre-flight validation rules
spec.componentRefsList of components to deploy

Criteria Fields

Criteria define when a recipe matches a user query:

FieldTypeDescriptionExample Values
serviceStringKubernetes platformeks, gke, aks, oke, kind, lke
acceleratorStringGPU hardware typeh100, gb200, b200, a100, l40, rtx-pro-6000
osStringOperating systemubuntu, rhel, cos, amazonlinux
intentStringWorkload purposetraining, inference
platformStringPlatform/framework typekubeflow
nodesIntegerNode count (0 = any)8, 16

All fields are optional. Unpopulated fields act as wildcards (match any value).

Constraint Format

Constraints use fully qualified measurement paths:

1constraints:
2 - name: K8s.server.version # {type}.{subtype}.{key}
3 value: ">= 1.32" # Expression or exact value
4
5 - name: OS.release.ID
6 value: ubuntu # Exact match
7
8 - name: OS.release.VERSION_ID
9 value: "24.04"
10
11 - name: OS.sysctl./proc/sys/kernel/osrelease
12 value: ">= 6.8"

Constraint Path Format: {MeasurementType}.{Subtype}.{Key}

Measurement TypeCommon Subtypes
K8sserver, image, config
OSrelease, sysctl, kmod, grub
GPUsmi, driver, device
SystemDcontainerd.service, kubelet.service

Supported Operators: >=, <=, >, <, ==, !=, or exact match (no operator)

Component Reference Structure

Each component in componentRefs defines a deployable unit. Components can be either Helm or Kustomize based.

Helm Component Example:

1componentRefs:
2 - name: gpu-operator # Component identifier (must match registry name)
3 type: Helm # Deployment type
4 source: https://helm.ngc.nvidia.com/nvidia # Helm repository URL
5 version: v25.3.3 # Chart version
6 valuesFile: components/gpu-operator/values.yaml # Path to values file
7 overrides: # Inline value overrides
8 driver:
9 version: 580.82.07
10 cdi:
11 enabled: true
12 dependencyRefs: # Components this depends on
13 - cert-manager

Kustomize Component Example:

1componentRefs:
2 - name: my-kustomize-app # Component identifier (must match registry name)
3 type: Kustomize # Deployment type
4 source: https://github.com/example/my-app # Git repository or OCI reference
5 tag: v1.0.0 # Git tag, branch, or commit
6 path: deploy/production # Path to kustomization within repo
7 patches: # Patch files to apply
8 - patches/custom-patch.yaml
9 dependencyRefs:
10 - cert-manager

Component Fields

FieldRequiredDescription
nameYesUnique component identifier (matches registry name)
typeYesHelm or Kustomize
sourceYesRepository URL, OCI reference, or Git URL
versionNoChart version (for Helm)
tagNoGit tag, branch, or commit (for Kustomize)
pathNoPath to kustomization within repository (for Kustomize)
valuesFileNoPath to values file (relative to data directory, for Helm)
overridesNoInline values that override valuesFile (for Helm)
patchesNoPatch files to apply (for Kustomize)
dependencyRefsNoList of component names this depends on

Multi-Level Inheritance

Recipe files support multi-level inheritance through the spec.base field. This enables building inheritance chains where intermediate recipes capture shared configurations, reducing duplication and improving maintainability.

Inheritance Mechanism

Each recipe can specify a parent recipe via spec.base:

1kind: RecipeMetadata
2apiVersion: aicr.nvidia.com/v1alpha1
3metadata:
4 name: gb200-eks-ubuntu-training
5
6spec:
7 base: eks-training # Inherits from eks-training recipe
8
9 criteria:
10 service: eks
11 accelerator: gb200
12 os: ubuntu
13 intent: training
14
15 # Only GB200-specific overrides here
16 componentRefs:
17 - name: gpu-operator
18 version: v25.3.3
19 overrides:
20 driver:
21 version: 580.82.07

Inheritance Chain Example

The system supports inheritance chains of arbitrary depth:

overlays/base.yaml
├── overlays/eks.yaml (spec.base: empty → inherits from base)
│ │
│ └── overlays/eks-training.yaml (spec.base: eks)
│ │
│ └── overlays/gb200-eks-training.yaml (spec.base: eks-training)
│ │
│ └── overlays/gb200-eks-ubuntu-training.yaml (spec.base: gb200-eks-training)
└── overlays/h100-ubuntu-inference.yaml (spec.base: empty → inherits from base)

Resolution Order: When resolving gb200-eks-ubuntu-training:

  1. Start with overlays/base.yaml (root)
  2. Merge overlays/eks.yaml (EKS-specific settings)
  3. Merge overlays/eks-training.yaml (training optimizations)
  4. Merge overlays/gb200-eks-training.yaml (GB200 + training-specific overrides)
  5. Merge overlays/gb200-eks-ubuntu-training.yaml (Ubuntu + full-spec overrides)

Inheritance Rules

1. Base Resolution

  • spec.base: "" or omitted → Inherits directly from overlays/base.yaml
  • spec.base: "eks" → Inherits from the recipe named “eks”
  • The root overlays/base.yaml has no parent (it’s the foundation)

2. Merge Precedence Later recipes in the chain override earlier ones:

base → eks → eks-training → gb200-eks-training → gb200-eks-ubuntu-training
(lowest) (highest priority)

3. Field Merging

  • Constraints: Same-named constraints are overridden; new constraints are added
  • ComponentRefs: Same-named components are merged field-by-field; new components are added
  • Criteria: Each recipe defines its own criteria (not inherited)

Intermediate vs Leaf Recipes

Intermediate Recipes (e.g., eks.yaml, eks-training.yaml):

  • Have partial criteria (not all fields specified)
  • Capture shared configurations for a category
  • Can be matched by user queries (but typically less specific)

Leaf Recipes (e.g., gb200-eks-ubuntu-training.yaml):

  • Have complete criteria (all required fields)
  • Matched by specific user queries
  • Contain final, hardware-specific overrides

Example: Inheritance Chain

1# overlays/base.yaml - Foundation for all recipes
2kind: RecipeMetadata
3apiVersion: aicr.nvidia.com/v1alpha1
4metadata:
5 name: base
6
7spec:
8 constraints:
9 - name: K8s.server.version
10 value: ">= 1.25"
11
12 componentRefs:
13 - name: cert-manager
14 type: Helm
15 source: https://charts.jetstack.io
16 version: v1.20.2
17 valuesFile: components/cert-manager/values.yaml
18
19 - name: gpu-operator
20 type: Helm
21 source: https://helm.ngc.nvidia.com/nvidia
22 version: v25.10.1
23 valuesFile: components/gpu-operator/values.yaml
24 dependencyRefs:
25 - cert-manager
1# eks.yaml - EKS-specific settings
2kind: RecipeMetadata
3apiVersion: aicr.nvidia.com/v1alpha1
4metadata:
5 name: eks
6
7spec:
8 # Implicit base (inherits from overlays/base.yaml)
9
10 criteria:
11 service: eks # Only service specified (partial criteria)
12
13 constraints:
14 - name: K8s.server.version
15 value: ">= 1.28" # EKS minimum version
1# eks-training.yaml - EKS training workloads
2kind: RecipeMetadata
3apiVersion: aicr.nvidia.com/v1alpha1
4metadata:
5 name: eks-training
6
7spec:
8 base: eks # Inherits EKS settings
9
10 criteria:
11 service: eks
12 intent: training # Added training intent (still partial)
13
14 constraints:
15 - name: K8s.server.version
16 value: ">= 1.30" # Training requires newer K8s
17
18 componentRefs:
19 # Training workloads use training-optimized values
20 - name: gpu-operator
21 valuesFile: components/gpu-operator/values-eks-training.yaml

Benefits of Multi-Level Inheritance

BenefitDescription
Reduced DuplicationShared settings defined once in intermediate recipes
Easier MaintenanceUpdate EKS settings in one place, all EKS recipes inherit
Clear OrganizationHierarchy reflects logical relationships
Flexible ExtensionAdd new leaf recipes without duplicating parent configs
TestableEach level can be validated independently

Mixin Composition

Inheritance is single-parent (spec.base), which means cross-cutting concerns like OS constraints or platform components would need to be duplicated across leaf overlays. Mixins solve this by providing composable fragments that leaf overlays reference via spec.mixins.

Mixin files live in recipes/mixins/ and use kind: RecipeMixin:

1# recipes/mixins/os-ubuntu.yaml
2kind: RecipeMixin
3apiVersion: aicr.nvidia.com/v1alpha1
4metadata:
5 name: os-ubuntu
6
7spec:
8 constraints:
9 - name: OS.release.ID
10 value: ubuntu
11 - name: OS.release.VERSION_ID
12 value: "24.04"
13 - name: OS.sysctl./proc/sys/kernel/osrelease
14 value: ">= 6.8"

Leaf overlays compose mixins alongside inheritance:

1# recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml
2spec:
3 base: h100-eks-training
4 mixins:
5 - os-ubuntu # Ubuntu constraints
6 - platform-kubeflow # Kubeflow trainer component
7 criteria:
8 service: eks
9 accelerator: h100
10 os: ubuntu
11 intent: training
12 platform: kubeflow

Mixin rules:

  • Mixins carry only constraints and componentRefs — no criteria, base, mixins, or validation
  • Mixins are applied after inheritance chain merging but before constraint evaluation
  • Conflict detection: a mixin constraint or component that conflicts with the inheritance chain or a previously applied mixin produces an error
  • When a snapshot is provided, mixin constraints are evaluated against it after merging; if any fail, the entire composed candidate is invalid and falls back to base-only output. In plain query mode (no snapshot), mixin constraints are merged but not evaluated

Criteria-Wildcard Overlays

Some overlays apply across a criteria dimension without being referenced via spec.base or included via spec.mixins. The resolver picks them up automatically because FindMatchingOverlays can return multiple independent maximal-leaf overlays for a single query, not just one. Ancestors of a matched leaf are filtered out of the candidate set, but sibling leaves whose criteria independently match are kept and their inheritance chains are resolved and merged in parallel. See Criteria Matching Algorithm and Recipe Generation Process for details.

This is useful for content that cross-cuts one criteria dimension but must stay tied to others — for example, a GB200 NCCL bandwidth target that applies to every service (EKS, OKE, etc.) but only for GB200 + training.

1# recipes/overlays/gb200-any-training.yaml
2spec:
3 base: base
4 criteria:
5 service: any # Wildcard — matches eks, oke, gke, etc.
6 accelerator: gb200
7 intent: training
8 validation:
9 performance:
10 checks:
11 - nccl-all-reduce-bw
12 constraints:
13 - name: nccl-all-reduce-bw
14 value: ">= 720"

When a query specifies {service: eks, accelerator: gb200, intent: training}, the resolver returns three maximal leaves — gb200-eks-training (matched by explicit criteria), gb200-any-training (matched by wildcard service: any), and monitoring-hpa (matched by wildcard intent: any). Their inheritance chains are resolved and merged with the base spec:

1appliedOverlays:
2 - base
3 - monitoring-hpa
4 - gb200-any-training # matched by wildcard criteria, not via base:
5 - eks
6 - eks-training
7 - gb200-eks-training

The nccl-all-reduce-bw constraint from gb200-any-training lands in the hydrated recipe without being duplicated in each service-specific overlay. (Adding os: ubuntu to the query would extend the chain with gb200-eks-ubuntu-training as the maximal leaf in place of gb200-eks-training; gb200-any-training would still match independently.)

Naming convention. The -any- segment signals this pattern: the static segments indicate the fixed criteria dimensions (accelerator, intent), and any marks the wildcard dimension. Examples: gb200-any-training.yaml, b200-any-training.yaml.

When to use a criteria-wildcard overlay vs a mixin:

Use a criteria-wildcard overlay when…Use a mixin when…
Content applies based on query criteriaContent applies based on explicit opt-in
The set of consumers is determined by criteria matchingThe set of consumers is an enumerated list of overlays
Adopt-by-default is desired for new matching overlaysEach consumer should reference it explicitly
You want to add validation blocks (mixins don’t carry validation)You only need constraints / componentRefs

Precedence when a wildcard overlay and a service-specific leaf collide. FindMatchingOverlays sorts its returned leaves by Criteria.Specificity() ascending, so less-specific overlays merge first and more-specific overlays merge last. Two different merge rules apply — they are not the same:

  • Top-level spec.constraints merge by name. A same-named constraint from the more-specific leaf overrides the wildcard’s value (the “overridden, new added” rule from the merge algorithm).
  • spec.validation.<phase> blocks (deployment, performance, conformance) are replaced wholesale when a later overlay defines the same phase — no field-level merge. The leaf’s checks and constraints replace the wildcard’s entire block.

This distinction matters. To override only the threshold in the wildcard example above, a service-specific leaf must restate both checks and constraints:

1# recipes/overlays/gb200-eks-training.yaml
2spec:
3 validation:
4 performance:
5 checks: # Must restate — else the phase is dropped
6 - nccl-all-reduce-bw
7 constraints:
8 - name: nccl-all-reduce-bw
9 value: ">= 650" # EKS-specific threshold

Setting only constraints drops the wildcard’s checks, which causes filterEntriesByRecipe to return zero entries and the performance phase to be skipped entirely — the opposite of the “lower the threshold” intent.

Criteria-wildcard overlays are only appropriate when the content is genuinely uniform across the wildcard dimension. If the value diverges (e.g., H100 NCCL targets differ by cloud: AKS ≥ 100, EKS ≥ 300, GKE ≥ 250), keep it inline in each service-specific overlay — collapsing divergent values to a lowest-common-denominator wildcard silently weakens validation.

See also: ADR-005: Overlay Refactoring — rationale for the maximal-leaf resolver semantics (Phase 2) and why wildcard overlays are preferred over multi-parent inheritance or intermediate-reparenting approaches that were prototyped and rejected.

Cycle Detection

The system detects circular inheritance to prevent infinite loops:

1# INVALID: Would create cycle
2# a.yaml: spec.base: b
3# b.yaml: spec.base: c
4# c.yaml: spec.base: a ← Cycle detected!

Tests in pkg/recipe/yaml_test.go automatically validate:

  • All spec.base references point to existing recipes
  • No circular inheritance chains exist
  • Inheritance depth is reasonable (max 10 levels)

Component Configuration

Components are configured through a three-tier system with clear precedence.

Configuration Patterns

Pattern 1: ValuesFile Only Traditional approach - all values in a separate file:

1componentRefs:
2 - name: gpu-operator
3 valuesFile: components/gpu-operator/values.yaml

Pattern 2: Overrides Only Fully self-contained recipe with inline values:

1componentRefs:
2 - name: gpu-operator
3 overrides:
4 driver:
5 version: 580.82.07
6 cdi:
7 enabled: true

Pattern 3: ValuesFile + Overrides (Hybrid) Reusable base with recipe-specific tweaks:

1componentRefs:
2 - name: gpu-operator
3 valuesFile: components/gpu-operator/values.yaml # Base configuration
4 overrides: # Recipe-specific tweaks
5 driver:
6 version: 580.82.07

Value Merge Precedence

When values are resolved, merge order is:

Base ValuesFile → Overlay ValuesFile → Overlay Overrides → CLI --set flags
(lowest) (highest)
  1. Base ValuesFile: Values from inherited recipes
  2. Overlay ValuesFile: Values file specified in the matching overlay
  3. Overlay Overrides: Inline overrides in the overlay’s componentRef
  4. CLI —set flags: Runtime overrides from aicr bundle --set

Component Values Files

Values files are stored in recipes/components/{component}/:

1# components/gpu-operator/values.yaml
2operator:
3 upgradeCRD: true
4 resources:
5 limits:
6 cpu: 500m
7 memory: 700Mi
8
9driver:
10 version: 580.105.08
11 enabled: true
12 useOpenKernelModules: true
13 rdma:
14 enabled: true
15
16devicePlugin:
17 enabled: true

Dependency Management

Components can declare dependencies via dependencyRefs:

1componentRefs:
2 - name: cert-manager
3 type: Helm
4 version: v1.20.2
5
6 - name: gpu-operator
7 type: Helm
8 version: v25.10.1
9 dependencyRefs:
10 - cert-manager # Deploy cert-manager first

The system performs topological sort to compute deployment order, ensuring dependencies are deployed before dependents. The resulting order is exposed in RecipeResult.DeploymentOrder.

Criteria Matching Algorithm

The recipe system uses an asymmetric rule matching algorithm where recipe criteria (rules) match against user queries (candidates).

Matching Rules

A recipe’s criteria matches a user query when every non-”any” field in the criteria is satisfied by the query:

  1. Empty/unpopulated fields in recipe criteria = Wildcard (matches any query value)
  2. Populated fields must match exactly (case-insensitive)
  3. Matching is asymmetric: A recipe with specific fields (e.g., accelerator: h100) will NOT match a generic query (e.g., accelerator: any)

Asymmetric Matching Explained

The key insight is that matching is one-directional:

  • Recipe “any” (or empty) → Matches ANY query value (acts as wildcard)
  • Query “any” → Only matches recipe “any” (does NOT match specific recipes)

This prevents overly specific recipes from being selected when the user hasn’t specified those criteria.

Matching Logic

1// Asymmetric matching: recipe criteria as receiver, query as parameter
2func (c *Criteria) Matches(other *Criteria) bool {
3 // If recipe (c) is "any" (or empty), it matches any query value (wildcard).
4 // If query (other) is "any" but recipe is specific, it does NOT match.
5 // If both have specific values, they must match exactly.
6
7 // For each field, call matchesCriteriaField(recipeValue, queryValue)
8 // ...
9 return true
10}
11
12// matchesCriteriaField implements asymmetric matching for a single field.
13func matchesCriteriaField(recipeValue, queryValue string) bool {
14 recipeIsAny := recipeValue == "any" || recipeValue == ""
15 queryIsAny := queryValue == "any" || queryValue == ""
16
17 // If recipe is "any", it matches any query value (recipe is generic)
18 if recipeIsAny {
19 return true
20 }
21
22 // Recipe has a specific value
23 // Query must also have that specific value (not "any")
24 if queryIsAny {
25 // Query is generic but recipe is specific - no match
26 return false
27 }
28
29 // Both have specific values - must match exactly
30 return recipeValue == queryValue
31}

Specificity Scoring

When multiple recipes match, they are sorted by specificity (number of non-”any” fields). More specific recipes are applied later, giving them higher precedence:

1func (c *Criteria) Specificity() int {
2 score := 0
3 if c.Service != "any" { score++ }
4 if c.Accelerator != "any" { score++ }
5 if c.Intent != "any" { score++ }
6 if c.OS != "any" { score++ }
7 if c.Nodes != 0 { score++ }
8 return score
9}

Matching Examples

Example 1: Broad Recipe Matches Specific Query

1Recipe Criteria: { service: "eks" }
2User Query: { service: "eks", os: "ubuntu", accelerator: "h100" }
3Result: ✅ MATCH - Recipe only requires service=eks, other fields are wildcards
4Specificity: 1

Example 2: Specific Recipe Doesn’t Match Different Specific Query

1Recipe Criteria: { service: "eks", accelerator: "gb200", intent: "training" }
2User Query: { service: "eks", os: "ubuntu", accelerator: "h100" }
3Result: ❌ NO MATCH - Accelerator doesn't match (gb200 ≠ h100)

Example 3: Specific Recipe Doesn’t Match Generic Query (Asymmetric)

1Recipe Criteria: { service: "eks", accelerator: "gb200", intent: "training" }
2User Query: { service: "eks", intent: "training" } # accelerator unspecified = "any"
3Result: ❌ NO MATCH - Recipe requires gb200, query has "any" (wildcard doesn't match specific)

This asymmetric behavior ensures that a generic query like --service eks --intent training only matches generic recipes, not hardware-specific ones like gb200-eks-training.yaml.

Example 4: Multiple Maximal Matches (Fully Specific Query)

1User Query: { service: "eks", os: "ubuntu", accelerator: "gb200", intent: "training" }
2
3Overlay criteria matches (pre-filter):
4 1. overlays/monitoring-hpa.yaml { intent: any } Specificity: 0
5 2. overlays/eks.yaml { service: eks } Specificity: 1
6 3. overlays/eks-training.yaml { service: eks, intent: training } Specificity: 2
7 4. overlays/gb200-any-training.yaml { service: any, accelerator: gb200, intent: training } Specificity: 2
8 5. overlays/gb200-eks-training.yaml { service: eks, accelerator: gb200, intent: training } Specificity: 3
9 6. overlays/gb200-eks-ubuntu-training.yaml { service: eks, accelerator: gb200, os: ubuntu, intent: training } Specificity: 4
10
11(base.yaml is the root spec, not an overlay candidate: FindMatchingOverlays
12iterates s.Overlays only. The base spec is always applied as the seed for
13the merged output — it is not selected by criteria matching.)
14
15Maximal leaves (after filterToMaximalLeaves):
16 - monitoring-hpa (no matching descendant)
17 - gb200-any-training (no matching descendant)
18 - gb200-eks-ubuntu-training (most-specific overlay; eks, eks-training,
19 gb200-eks-training are ancestors and are filtered out)
20
21Result: Each maximal leaf's inheritance chain is resolved and merged onto
22the base spec. Ancestors removed by the filter re-enter the output via
23chain resolution (step 3), so the final appliedOverlays is
24[base, monitoring-hpa, gb200-any-training, eks, eks-training,
25gb200-eks-training, gb200-eks-ubuntu-training].

Note that multiple maximal leaves can coexist when their inheritance chains are independent — gb200-any-training (via wildcard service: any) and gb200-eks-ubuntu-training (via explicit criteria) are both kept because neither is an ancestor of the other. This is what enables the criteria-wildcard overlay pattern.

Recipe Generation Process

The recipe builder (pkg/recipe/metadata_store.go) generates recipes through the following steps:

Step 1: Load Metadata Store

1store, err := loadMetadataStore(ctx)
  • Embedded YAML files are parsed into Go structs
  • Cached in memory on first access (singleton pattern with sync.Once)
  • Contains base recipe, all overlays, mixins, and component values files

Step 2: Find Matching Overlays

1overlays := store.FindMatchingOverlays(criteria)
  • Iterate all overlays in s.Overlays (the base recipe is held separately in s.Base and is not a candidate here — it is injected as the merge seed by initBaseMergedSpec() in Step 4)
  • Check if each overlay’s criteria matches the user query
  • Filter to maximal leaves via filterToMaximalLeaves(): drop any match that is an ancestor (via spec.base) of another match. Ancestors are re-added later via chain resolution; this filter ensures that a matched descendant isn’t double-counted with its own chain
  • Sort maximal-leaf matches by specificity (least specific first)

Multiple maximal leaves can be returned for one query when they sit on independent inheritance chains — for example, a service: any wildcard overlay and the most-specific service-specific leaf are both kept (see Criteria-Wildcard Overlays).

Step 3: Resolve Inheritance Chains

For each maximal-leaf match from step 2:

1chain, err := store.resolveInheritanceChain(overlay.Metadata.Name)
  • Build the chain from root (base) to the target overlay by walking spec.base
  • Detect cycles to prevent infinite loops
  • Example: ["base", "eks", "eks-training", "gb200-eks-ubuntu-training"]

Ancestors filtered out in step 2 re-enter the output here as part of their descendant’s chain.

Step 4: Merge Specifications

The merge starts from a seed containing the base spec, then applies each resolved chain on top:

1mergedSpec, appliedOverlays := s.initBaseMergedSpec() // seed with s.Base
2// ... then for each chain from Step 3:
3for _, recipe := range chain {
4 mergedSpec.Merge(&recipe.Spec)
5}

This is why base always appears first in appliedOverlays even though it is not returned by FindMatchingOverlays.

Merge Algorithm

  • Constraints: Same-named constraints are overridden; new constraints are added
  • ComponentRefs: Same-named components are merged field-by-field using mergeComponentRef()
1func mergeComponentRef(base, overlay ComponentRef) ComponentRef {
2 result := base
3 if overlay.Type != "" { result.Type = overlay.Type }
4 if overlay.Source != "" { result.Source = overlay.Source }
5 if overlay.Version != "" { result.Version = overlay.Version }
6 if overlay.ValuesFile != "" { result.ValuesFile = overlay.ValuesFile }
7 // Merge overrides maps
8 if overlay.Overrides != nil {
9 result.Overrides = deepMerge(base.Overrides, overlay.Overrides)
10 }
11 // Merge dependency refs
12 if len(overlay.DependencyRefs) > 0 {
13 result.DependencyRefs = mergeDependencyRefs(base.DependencyRefs, overlay.DependencyRefs)
14 }
15 return result
16}

Step 5: Apply Mixins

1mixinConstraintNames, err := store.mergeMixins(mergedSpec)
  • If the leaf overlay declares spec.mixins, each named mixin is loaded from recipes/mixins/
  • Mixin constraints and componentRefs are appended to the merged spec
  • Conflict detection prevents duplicates between the inheritance chain, previously applied mixins, and the current mixin
  • When a snapshot evaluator is provided, mixin constraints are evaluated against it after merging; failure invalidates the entire composed candidate. In plain query mode (no snapshot), mixin constraints are merged but not evaluated

Step 6: Validate Dependencies

1if err := mergedSpec.ValidateDependencies(); err != nil {
2 return nil, err
3}
  • Verify all dependencyRefs reference existing components
  • Detect circular dependencies

Step 7: Compute Deployment Order

1deployOrder, err := mergedSpec.TopologicalSort()
  • Topologically sort components based on dependencyRefs
  • Ensures dependencies are deployed before dependents

Step 8: Build RecipeResult

1return &RecipeResult{
2 Kind: "RecipeResult",
3 APIVersion: "aicr.nvidia.com/v1alpha1",
4 Metadata: metadata,
5 Criteria: criteria,
6 Constraints: mergedSpec.Constraints,
7 ComponentRefs: mergedSpec.ComponentRefs,
8 DeploymentOrder: deployOrder,
9}, nil

Complete Flow Diagram

Usage Examples

CLI Usage

Basic recipe generation (query mode):

$aicr recipe --os ubuntu --service eks --accelerator h100 --intent training

Full specification:

$aicr recipe \
> --os ubuntu \
> --service eks \
> --accelerator gb200 \
> --intent training \
> --nodes 8 \
> --format yaml \
> --output recipe.yaml

From snapshot (snapshot mode):

$aicr snapshot --output snapshot.yaml
$aicr recipe --snapshot snapshot.yaml --intent training --output recipe.yaml

API Usage

Basic query:

$curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks&accelerator=h100"

Full specification:

$curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks&accelerator=gb200&intent=training&nodes=8"

Example Response (RecipeResult)

1{
2 "kind": "RecipeResult",
3 "apiVersion": "aicr.nvidia.com/v1alpha1",
4 "metadata": {
5 "version": "v0.8.0",
6 "appliedOverlays": [
7 "base",
8 "eks",
9 "eks-training",
10 "gb200-eks-ubuntu-training"
11 ]
12 },
13 "criteria": {
14 "service": "eks",
15 "accelerator": "gb200",
16 "os": "ubuntu",
17 "intent": "training"
18 },
19 "constraints": [
20 {
21 "name": "K8s.server.version",
22 "value": ">= 1.32.4"
23 },
24 {
25 "name": "OS.release.ID",
26 "value": "ubuntu"
27 },
28 {
29 "name": "OS.release.VERSION_ID",
30 "value": "24.04"
31 }
32 ],
33 "componentRefs": [
34 {
35 "name": "cert-manager",
36 "type": "Helm",
37 "source": "https://charts.jetstack.io",
38 "version": "v1.20.2",
39 "valuesFile": "components/cert-manager/values.yaml"
40 },
41 {
42 "name": "gpu-operator",
43 "type": "Helm",
44 "source": "https://helm.ngc.nvidia.com/nvidia",
45 "version": "v25.3.3",
46 "valuesFile": "components/gpu-operator/values-eks-training.yaml",
47 "overrides": {
48 "driver": {
49 "version": "580.82.07"
50 },
51 "cdi": {
52 "enabled": true
53 }
54 },
55 "dependencyRefs": ["cert-manager"]
56 },
57 {
58 "name": "nvsentinel",
59 "type": "Helm",
60 "source": "oci://ghcr.io/nvidia/nvsentinel",
61 "version": "v0.6.0",
62 "valuesFile": "components/nvsentinel/values.yaml",
63 "dependencyRefs": ["cert-manager"]
64 },
65 {
66 "name": "nodewright-operator",
67 "type": "Helm",
68 "source": "oci://ghcr.io/nvidia/skyhook",
69 "version": "v0.15.0",
70 "valuesFile": "components/nodewright-operator/values.yaml",
71 "overrides": {
72 "customization": "ubuntu"
73 }
74 }
75 ],
76 "deploymentOrder": [
77 "cert-manager",
78 "gpu-operator",
79 "nvsentinel",
80 "nodewright-operator"
81 ]
82}

Maintenance Guide

Adding a New Recipe

  1. Create the recipe file in recipes/:

    1kind: RecipeMetadata
    2apiVersion: aicr.nvidia.com/v1alpha1
    3metadata:
    4 name: l40-gke-ubuntu-inference # Unique name
    5
    6spec:
    7 base: gke-inference # Inherit from appropriate parent
    8
    9 criteria:
    10 service: gke
    11 accelerator: l40
    12 os: ubuntu
    13 intent: inference
    14
    15 constraints:
    16 - name: K8s.server.version
    17 value: ">= 1.29"
    18
    19 componentRefs:
    20 - name: gpu-operator
    21 version: v25.3.3
    22 overrides:
    23 driver:
    24 version: 560.35.03
  2. Create intermediate recipes if needed (e.g., gke.yaml, gke-inference.yaml)

  3. Add component values files if using new configurations:

    1# components/gpu-operator/values-gke-inference.yaml
    2driver:
    3 enabled: true
    4 version: 560.35.03
  4. Run tests to validate:

    $go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly

Modifying Existing Recipes

  1. Update constraints - Change version requirements:

    1constraints:
    2 - name: K8s.server.version
    3 value: ">= 1.33" # Updated from 1.32
  2. Update component versions - Bump chart versions:

    1componentRefs:
    2 - name: gpu-operator
    3 version: v25.4.0 # Updated from v25.3.3
  3. Add inline overrides - Recipe-specific tweaks:

    1componentRefs:
    2 - name: gpu-operator
    3 overrides:
    4 newFeature:
    5 enabled: true

Updating Component Values

  1. Modify values file in recipes/components/{component}/values.yaml

  2. Create variant values file for specific environments:

    • values.yaml - Base configuration
    • values-eks-training.yaml - EKS training optimization
    • values-gke-inference.yaml - GKE inference optimization
  3. Reference in recipe:

    1componentRefs:
    2 - name: gpu-operator
    3 valuesFile: components/gpu-operator/values-gke-inference.yaml

Automated Validation

The recipe data system includes comprehensive automated tests to ensure data integrity. These tests run automatically as part of make test and validate all recipe metadata files and component values.

Test Suite Overview

The test suite is organized in pkg/recipe/:

FileResponsibility
yaml_test.goStatic YAML file validation (parsing, references, enums, inheritance)
metadata_test.goRuntime behavior tests (Merge, TopologicalSort, inheritance resolution)
recipe_test.goRecipe struct validation (Validate, ValidateStructure)

Test Categories

Test CategoryWhat It Validates
Schema ConformanceAll YAML files parse correctly with expected structure
Criteria ValidationValid enum values for service, accelerator, intent, OS
Reference ValidationvaluesFile paths exist, dependencyRefs resolve, component names valid
Constraint SyntaxMeasurement paths use valid types, operators are valid
Dependency CyclesNo circular dependencies in componentRefs
Inheritance ChainsBase references valid, no circular inheritance, reasonable depth
Values FilesComponent values files parse as valid YAML

Inheritance-Specific Tests

TestWhat It Validates
TestAllBaseReferencesPointToExistingRecipesAll spec.base references resolve to existing recipes
TestNoCircularBaseReferencesNo circular inheritance chains (a→b→c→a)
TestInheritanceChainDepthReasonableInheritance depth ≤ 10 levels

Running Tests

$# Run all recipe data tests
$make test
$
$# Run only recipe package tests
$go test -v ./pkg/recipe/... -count=1
$
$# Run specific test patterns
$go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly
$go test -v ./pkg/recipe/... -run TestAllBaseReferencesPointToExistingRecipes
$go test -v ./pkg/recipe/... -run TestAllOverlayCriteriaUseValidEnums

CI/CD Integration

Tests run automatically on:

  • Pull Requests: All tests must pass before merge
  • Push to main: Validates no regressions
  • Release builds: Ensures data integrity in released binaries
1# GitHub Actions workflow snippet
2jobs:
3 validate:
4 runs-on: ubuntu-latest
5 steps:
6 - uses: actions/checkout@v5
7 - uses: ./.github/actions/go-ci
8 with:
9 go_version: '1.26'
10 golangci_lint_version: 'v2.11.3'

Adding New Tests

When adding new recipe metadata or component configurations:

  1. Create the new file in recipes/
  2. Run tests to verify the file is valid:
    $go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly
  3. Check for conflicts with existing recipes:
    $go test -v ./pkg/recipe/... -run TestNoDuplicateCriteria
  4. Verify references if using valuesFile or dependencyRefs:
    $go test -v ./pkg/recipe/... -run TestValuesFileReferencesExist
    $go test -v ./pkg/recipe/... -run TestDependencyRefsResolve

External Data Provider

The recipe system supports extending or overriding embedded data with external files via the --data CLI flag. This enables customization without rebuilding the CLI binary.

Architecture Overview

Data Provider Interface

The system uses a DataProvider interface to abstract file access:

1type DataProvider interface {
2 // ReadFile reads a file by path (relative to data directory)
3 ReadFile(path string) ([]byte, error)
4
5 // WalkDir walks the directory tree rooted at root
6 WalkDir(root string, fn fs.WalkDirFunc) error
7
8 // Source returns where data came from (for debugging)
9 Source(path string) string
10}

Provider Types:

  • EmbeddedDataProvider: Wraps Go’s embed.FS for compile-time embedded data
  • LayeredDataProvider: Overlays external directory on top of embedded data

Merge Behavior

File TypeBehaviorExample
registry.yamlMerged by component nameExternal adds/replaces components
overlays/base.yamlReplaced if exists externallyExternal completely overrides embedded
overlays/*.yamlReplaced if same pathExternal overlay replaces embedded
components/*/values.yamlReplaced if same pathExternal values override embedded

Registry Merge Algorithm

When merging registry.yaml, components are matched by their name field:

1func mergeRegistries(embedded, external *ComponentRegistry) *ComponentRegistry {
2 // 1. Index external components by name
3 externalByName := make(map[string]*ComponentConfig)
4 for _, comp := range external.Components {
5 externalByName[comp.Name] = comp
6 }
7
8 // 2. Add embedded components, replacing with external if present
9 for _, comp := range embedded.Components {
10 if ext, found := externalByName[comp.Name]; found {
11 result.Components = append(result.Components, *ext) // External wins
12 } else {
13 result.Components = append(result.Components, comp) // Keep embedded
14 }
15 }
16
17 // 3. Add new components from external (not in embedded)
18 for _, comp := range external.Components {
19 if !addedNames[comp.Name] {
20 result.Components = append(result.Components, comp)
21 }
22 }
23
24 return result
25}

Merge Order:

  1. Start with all embedded components
  2. Replace any that have same name in external
  3. Add any new components from external

Security Validations

The LayeredDataProvider enforces security constraints:

ValidationBehavior
Directory existsExternal directory must exist and be a directory
registry.yaml requiredExternal directory must contain registry.yaml
No path traversalPaths containing .. are rejected
No symlinksSymlinks are rejected by default (AllowSymlinks: false)
File size limitFiles exceeding 10MB are rejected (configurable)

Configuration Options

1type LayeredProviderConfig struct {
2 // ExternalDir is the path to the external data directory
3 ExternalDir string
4
5 // MaxFileSize is the maximum allowed file size in bytes (default: 10MB)
6 MaxFileSize int64
7
8 // AllowSymlinks allows symlinks in the external directory (default: false)
9 AllowSymlinks bool
10}

Usage Example

Creating an external data directory:

my-data/
├── registry.yaml # Required - merged with embedded
├── overlays/
│ └── my-custom-overlay.yaml # Adds new overlay
└── components/
└── gpu-operator/
└── values.yaml # Replaces embedded gpu-operator values

External registry.yaml (adds custom Helm component):

1apiVersion: aicr.nvidia.com/v1alpha1
2kind: ComponentRegistry
3components:
4 - name: my-custom-operator
5 displayName: My Custom Operator
6 helm:
7 defaultRepository: https://my-charts.example.com
8 defaultChart: my-custom-operator
9 defaultVersion: v1.0.0

External registry.yaml (adds custom Kustomize component):

1apiVersion: aicr.nvidia.com/v1alpha1
2kind: ComponentRegistry
3components:
4 - name: my-kustomize-app
5 displayName: My Kustomize App
6 valueOverrideKeys:
7 - mykustomize
8 kustomize:
9 defaultSource: https://github.com/example/my-app
10 defaultPath: deploy/production
11 defaultTag: v1.0.0

Note: A component must have either helm OR kustomize configuration, not both.

CLI usage:

$# Generate recipe with external data
$aicr recipe --service eks --accelerator h100 --data ./my-data
$
$# Bundle with external data
$aicr bundle --recipe recipe.yaml --data ./my-data --output ./bundles

Debugging

Use --debug flag to see detailed logging about external data loading:

$aicr --debug recipe --service eks --data ./my-data

Debug logs include:

  • Files discovered in external directory
  • Source resolution for each file (embedded vs external vs merged)
  • Component merge details (added, overridden, retained)

Implementation Details

The data provider is initialized early in CLI command execution:

1// pkg/cli/root.go
2func initDataProvider(cmd *cli.Command) error {
3 dataDir := cmd.String("data")
4 if dataDir == "" {
5 return nil // Use default embedded provider
6 }
7
8 embedded := recipe.NewEmbeddedDataProvider(recipe.GetEmbeddedFS(), "data")
9 layered, err := recipe.NewLayeredDataProvider(embedded, recipe.LayeredProviderConfig{
10 ExternalDir: dataDir,
11 AllowSymlinks: false,
12 })
13 if err != nil {
14 return err
15 }
16
17 recipe.SetDataProvider(layered)
18 return nil
19}

Global Provider Pattern:

  • SetDataProvider() sets the global data provider
  • GetDataProvider() returns the current provider (defaults to embedded)
  • GetDataProviderGeneration() returns a counter for cache invalidation

See Also