Data Architecture | NVIDIA AI Cluster Runtime

This document describes the recipe metadata system used by the CLI and API to generate optimized system configuration recommendations (i.e. recipes) based on environment parameters.

Overview

The recipe system is a rule-based configuration engine that generates tailored system configurations by:

Starting with a base recipe - Universal component definitions and constraints applicable to every recipe
Matching environment-specific overlays - Targeted configurations based on query criteria (service, accelerator, OS, intent)
Resolving inheritance chains - Overlays can inherit from intermediate recipes to reduce duplication
Merging configurations - Components, constraints, and values are merged with overlay precedence
Computing deployment order - Topological sort of components based on dependency references

The recipe data is organized in recipes/ as multiple YAML files:

recipes/
├── registry.yaml                  # Component registry (Helm & Kustomize configs)
├── overlays/                      # Recipe overlays (including base)
│   ├── base.yaml                  # Root recipe - all recipes inherit from this
│   ├── eks.yaml                   # EKS-specific settings
│   ├── eks-training.yaml          # EKS + training workloads (inherits from eks)
│   ├── gb200-eks-ubuntu-training.yaml # GB200/EKS/Ubuntu/training (inherits from eks-training)
│   └── h100-ubuntu-inference.yaml # H100/Ubuntu/inference
├── mixins/                        # Composable mixin fragments (kind: RecipeMixin)
│   ├── os-ubuntu.yaml             # Ubuntu OS constraints (shared by leaf overlays)
│   ├── platform-inference.yaml    # Inference gateway components (shared by service-inference overlays)
│   └── platform-kubeflow.yaml     # Kubeflow trainer component (shared by leaf overlays)
└── components/                    # Component values files
    ├── cert-manager/
    │   └── values.yaml
    ├── gpu-operator/
    │   ├── values.yaml            # Base GPU Operator values
    │   └── values-eks-training.yaml # EKS training-optimized values
    ├── network-operator/
    │   └── values.yaml
    ├── nvidia-dra-driver-gpu/
    │   └── values.yaml
    ├── nvsentinel/
    │   └── values.yaml
    └── nodewright-operator/
        └── values.yaml

Note: These files are embedded into both the CLI binary and API server at compile time, making the system fully self-contained with no external dependencies.

Extensibility: The embedded data can be extended or overridden using the --data flag. See External Data Provider for details.

Recipe Usage Patterns:

CLI Query Mode - Direct recipe generation from criteria parameters:

$ aicr recipe --os ubuntu --accelerator h100 --service eks --intent training

CLI Snapshot Mode - Analyze captured system state to infer criteria:

$ aicr snapshot --output system.yaml
$ aicr recipe --snapshot system.yaml --intent training

API Server - HTTP endpoint (query mode only):

$ curl "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100&service=eks&intent=training"

Data Structure

Recipe Metadata Format

Each recipe file follows this structure:

1 kind: RecipeMetadata
2 apiVersion: aicr.nvidia.com/v1alpha1
3 metadata:
4   name: &lt;recipe-name>  # Unique identifier (e.g., "eks-training", "gb200-eks-ubuntu-training")
5 
6 spec:
7   base: &lt;parent-recipe>  # Optional - inherits from another recipe
8   mixins:                # Optional - composable mixin fragments
9     - os-ubuntu          #   OS constraints (from recipes/mixins/)
10     - platform-kubeflow  #   Platform components (from recipes/mixins/)
11   
12   criteria:              # When this recipe/overlay applies
13     service: eks         # Kubernetes platform
14     accelerator: gb200   # GPU type
15     os: ubuntu           # Operating system
16     intent: training     # Workload purpose
17     platform: kubeflow    # Platform/framework (optional)
18   
19   constraints:           # Deployment requirements
20     - name: K8s.server.version
21       value: ">= 1.32"
22   
23   componentRefs:         # Components to deploy
24     - name: gpu-operator
25       type: Helm
26       source: https://helm.ngc.nvidia.com/nvidia
27       version: v25.3.3
28       valuesFile: components/gpu-operator/values.yaml
29       dependencyRefs:
30         - cert-manager

Top-Level Fields

Field	Description
`kind`	Always `recipeMetadata`
`apiVersion`	Always `aicr.nvidia.com/v1alpha1`
`metadata.name`	Unique recipe identifier
`spec.base`	Parent recipe to inherit from (empty = inherits from `overlays/base.yaml`)
`spec.mixins`	List of mixin names to compose (e.g., `["os-ubuntu", "platform-kubeflow"]`)
`spec.criteria`	Query parameters that select this recipe
`spec.constraints`	Pre-flight validation rules
`spec.componentRefs`	List of components to deploy

Criteria Fields

Criteria define when a recipe matches a user query:

Field	Type	Description	Example Values
`service`	String	Kubernetes platform	`eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm`
`accelerator`	String	GPU hardware type	`h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000`
`os`	String	Operating system	`ubuntu`, `rhel`, `cos`, `amazonlinux`
`intent`	String	Workload purpose	`training`, `inference`
`platform`	String	Platform/framework type	`dynamo`, `kubeflow`, `nim`, `runai`, `slurm`
`nodes`	Integer	Node count (0 = any)	`8`, `16`

All fields are optional. Unpopulated fields act as wildcards (match any value).

Constraint Format

Constraints use fully qualified measurement paths:

1 constraints:
2   - name: K8s.server.version    # \{type\}.\{subtype\}.\{key\}
3     value: ">= 1.32"            # Expression or exact value
4   
5   - name: OS.release.ID
6     value: ubuntu               # Exact match
7   
8   - name: OS.release.VERSION_ID
9     value: "24.04"
10   
11   - name: OS.sysctl./proc/sys/kernel/osrelease
12     value: ">= 6.8"

Constraint Path Format: \{MeasurementType\}.\{Subtype\}.\{Key\}

Measurement Type	Common Subtypes
`K8s`	`server`, `image`, `config`
`OS`	`release`, `sysctl`, `kmod`, `grub`
`GPU`	`smi`, `driver`, `device`
`SystemD`	`containerd.service`, `kubelet.service`

Supported Operators: >=, <=, >, <, ==, !=, or exact match (no operator)

Component Reference Structure

Each component in componentRefs defines a deployable unit. Components can be either Helm or Kustomize based.

Helm Component Example:

1 componentRefs:
2   - name: gpu-operator           # Component identifier (must match registry name)
3     type: Helm                   # Deployment type
4     source: https://helm.ngc.nvidia.com/nvidia  # Helm repository URL
5     version: v25.3.3             # Chart version
6     valuesFile: components/gpu-operator/values.yaml  # Path to values file
7     overrides:                   # Inline value overrides
8       driver:
9         version: 580.82.07
10       cdi:
11         enabled: true
12     dependencyRefs:              # Components this depends on
13       - cert-manager

Kustomize Component Example:

1 componentRefs:
2   - name: my-kustomize-app       # Component identifier (must match registry name)
3     type: Kustomize              # Deployment type
4     source: https://github.com/example/my-app  # Git repository or OCI reference
5     tag: v1.0.0                  # Git tag, branch, or commit
6     path: deploy/production      # Path to kustomization within repo
7     patches:                     # Patch files to apply
8       - patches/custom-patch.yaml
9     dependencyRefs:
10       - cert-manager

Component Fields

Field	Required	Description
`name`	Yes	Unique component identifier (matches registry name)
`type`	Yes	`Helm` or `Kustomize`
`source`	Yes	Repository URL, OCI reference, or Git URL
`version`	No	Chart version (for Helm)
`tag`	No	Git tag, branch, or commit (for Kustomize)
`path`	No	Path to kustomization within repository (for Kustomize)
`valuesFile`	No	Path to values file (relative to data directory, for Helm)
`overrides`	No	Inline values that override valuesFile (for Helm)
`patches`	No	Patch files to apply (for Kustomize)
`dependencyRefs`	No	List of component names this depends on

Multi-Level Inheritance

Recipe files support multi-level inheritance through the spec.base field. This enables building inheritance chains where intermediate recipes capture shared configurations, reducing duplication and improving maintainability.

Inheritance Mechanism

Each recipe can specify a parent recipe via spec.base:

1 kind: RecipeMetadata
2 apiVersion: aicr.nvidia.com/v1alpha1
3 metadata:
4   name: gb200-eks-ubuntu-training
5 
6 spec:
7   base: eks-training  # Inherits from eks-training recipe
8   
9   criteria:
10     service: eks
11     accelerator: gb200
12     os: ubuntu
13     intent: training
14     
15   # Only GB200-specific overrides here
16   componentRefs:
17     - name: gpu-operator
18       version: v25.3.3
19       overrides:
20         driver:
21           version: 580.82.07

Inheritance Chain Example

The system supports inheritance chains of arbitrary depth:

overlays/base.yaml
    │
    ├── overlays/eks.yaml (spec.base: empty → inherits from base)
    │       │
    │       └── overlays/eks-training.yaml (spec.base: eks)
    │               │
    │               └── overlays/gb200-eks-training.yaml (spec.base: eks-training)
    │                       │
    │                       └── overlays/gb200-eks-ubuntu-training.yaml (spec.base: gb200-eks-training)
    │
    └── overlays/h100-ubuntu-inference.yaml (spec.base: empty → inherits from base)

Resolution Order: When resolving gb200-eks-ubuntu-training:

Start with overlays/base.yaml (root)
Merge overlays/eks.yaml (EKS-specific settings)
Merge overlays/eks-training.yaml (training optimizations)
Merge overlays/gb200-eks-training.yaml (GB200 + training-specific overrides)
Merge overlays/gb200-eks-ubuntu-training.yaml (Ubuntu + full-spec overrides)

Inheritance Rules

1. Base Resolution

spec.base: "" or omitted → Inherits directly from overlays/base.yaml
spec.base: "eks" → Inherits from the recipe named “eks”
The root overlays/base.yaml has no parent (it’s the foundation)

2. Merge Precedence Later recipes in the chain override earlier ones:

base → eks → eks-training → gb200-eks-training → gb200-eks-ubuntu-training
(lowest)                                            (highest priority)

3. Field Merging

Constraints: Same-named constraints are overridden; new constraints are added
ComponentRefs: Same-named components are merged field-by-field; new components are added
Criteria: Each recipe defines its own criteria (not inherited)

Intermediate vs Leaf Recipes

Intermediate Recipes (e.g., eks.yaml, eks-training.yaml):

Have partial criteria (not all fields specified)
Capture shared configurations for a category
Can be matched by user queries (but typically less specific)

Leaf Recipes (e.g., gb200-eks-ubuntu-training.yaml):

Have complete criteria (all required fields)
Matched by specific user queries
Contain final, hardware-specific overrides

Example: Inheritance Chain

1 # overlays/base.yaml - Foundation for all recipes
2 kind: RecipeMetadata
3 apiVersion: aicr.nvidia.com/v1alpha1
4 metadata:
5   name: base
6 
7 spec:
8   constraints:
9     - name: K8s.server.version
10       value: ">= 1.25"
11 
12   componentRefs:
13     - name: cert-manager
14       type: Helm
15       source: https://charts.jetstack.io
16       version: v1.20.2
17       valuesFile: components/cert-manager/values.yaml
18 
19     - name: gpu-operator
20       type: Helm
21       source: https://helm.ngc.nvidia.com/nvidia
22       version: v25.10.1
23       valuesFile: components/gpu-operator/values.yaml
24       dependencyRefs:
25         - cert-manager

1 # eks.yaml - EKS-specific settings
2 kind: RecipeMetadata
3 apiVersion: aicr.nvidia.com/v1alpha1
4 metadata:
5   name: eks
6 
7 spec:
8   # Implicit base (inherits from overlays/base.yaml)
9   
10   criteria:
11     service: eks  # Only service specified (partial criteria)
12 
13   constraints:
14     - name: K8s.server.version
15       value: ">= 1.28"  # EKS minimum version

1 # eks-training.yaml - EKS training workloads
2 kind: RecipeMetadata
3 apiVersion: aicr.nvidia.com/v1alpha1
4 metadata:
5   name: eks-training
6 
7 spec:
8   base: eks  # Inherits EKS settings
9   
10   criteria:
11     service: eks
12     intent: training  # Added training intent (still partial)
13 
14   constraints:
15     - name: K8s.server.version
16       value: ">= 1.30"  # Training requires newer K8s
17 
18   componentRefs:
19     # Training workloads use training-optimized values
20     - name: gpu-operator
21       valuesFile: components/gpu-operator/values-eks-training.yaml

Benefits of Multi-Level Inheritance

Benefit	Description
Reduced Duplication	Shared settings defined once in intermediate recipes
Easier Maintenance	Update EKS settings in one place, all EKS recipes inherit
Clear Organization	Hierarchy reflects logical relationships
Flexible Extension	Add new leaf recipes without duplicating parent configs
Testable	Each level can be validated independently

Mixin Composition

Inheritance is single-parent (spec.base), which means cross-cutting concerns like OS constraints or platform components would need to be duplicated across leaf overlays. Mixins solve this by providing composable fragments that leaf overlays reference via spec.mixins.

Mixin files live in recipes/mixins/ and use kind: RecipeMixin:

1 # recipes/mixins/os-ubuntu.yaml
2 kind: RecipeMixin
3 apiVersion: aicr.nvidia.com/v1alpha1
4 metadata:
5   name: os-ubuntu
6 
7 spec:
8   constraints:
9     - name: OS.release.ID
10       value: ubuntu
11     - name: OS.release.VERSION_ID
12       value: "24.04"
13     - name: OS.sysctl./proc/sys/kernel/osrelease
14       value: ">= 6.8"

Leaf overlays compose mixins alongside inheritance:

1 # recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml
2 spec:
3   base: h100-eks-training
4   mixins:
5     - os-ubuntu          # Ubuntu constraints
6     - platform-kubeflow  # Kubeflow trainer component
7   criteria:
8     service: eks
9     accelerator: h100
10     os: ubuntu
11     intent: training
12     platform: kubeflow

Mixin rules:

Mixins carry only constraints and componentRefs — no criteria, base, mixins, or validation
Mixins are applied after inheritance chain merging but before constraint evaluation
Conflict detection: a mixin constraint or component that conflicts with the inheritance chain or a previously applied mixin produces an error
When a snapshot is provided, mixin constraints are evaluated against it after merging; if any fail, the entire composed candidate is invalid and falls back to base-only output. In plain query mode (no snapshot), mixin constraints are merged but not evaluated

Criteria-Wildcard Overlays

Some overlays apply across a criteria dimension without being referenced via spec.base or included via spec.mixins. The resolver picks them up automatically because FindMatchingOverlays can return multiple independent maximal-leaf overlays for a single query, not just one. Ancestors of a matched leaf are filtered out of the candidate set, but sibling leaves whose criteria independently match are kept and their inheritance chains are resolved and merged in parallel. See Criteria Matching Algorithm and Recipe Generation Process for details.

This is useful for content that cross-cuts one criteria dimension but must stay tied to others — for example, a GB200 deployment-phase floor (gpu-operator version pin + standard health checks) that applies to every service (EKS, OKE, etc.) and every intent for the accelerator.

1 # recipes/overlays/gb200-any.yaml
2 spec:
3   base: base
4   criteria:
5     service: any         # Wildcard — matches eks, oke, gke, etc.
6     accelerator: gb200
7   validation:
8     deployment:
9       checks:
10         - operator-health
11         - expected-resources
12         - gpu-operator-version
13         - check-nvidia-smi
14       constraints:
15         - name: Deployment.gpu-operator.version
16           value: ">= v25.10.0"

When a query specifies \{service: eks, accelerator: gb200, intent: training\}, the resolver returns three maximal leaves — gb200-eks-training (matched by explicit criteria), gb200-any (matched by wildcard service: any), and monitoring-hpa (matched by wildcard intent: any). Their inheritance chains are resolved and merged with the base spec:

1 appliedOverlays:
2   - base
3   - monitoring-hpa
4   - gb200-any              # matched by wildcard criteria, not via base:
5   - eks
6   - eks-training
7   - gb200-eks-training

The gpu-operator version pin from gb200-any lands in the hydrated recipe without being duplicated in each service-specific overlay. (Adding os: ubuntu to the query would extend the chain with gb200-eks-ubuntu-training as the maximal leaf in place of gb200-eks-training; gb200-any would still match independently.)

Naming convention. The -any (or -any-<intent>) segment signals this pattern: the static segments indicate the fixed criteria dimensions (accelerator, optionally intent), and any marks the wildcard dimension. Examples: gb200-any.yaml, h100-any.yaml, rtx-pro-6000-any.yaml.

Don’t carry per-fabric values here. Cross-service-uniform content (gpu-operator version pin, standard health checks) is a good fit. Per-fabric content (NCCL bandwidth thresholds across services with different network fabrics — EFA, TCPXO, RoCE) is not — declare those in each service-specific leaf instead. The intent-scoped gb200-any-training.yaml and b200-any-training.yaml overlays that previously carried cross-service NCCL thresholds were retired for this reason (gb200-any-training in #1052, b200-any-training in #1053).

When to use a criteria-wildcard overlay vs a mixin:

Use a criteria-wildcard overlay when…	Use a mixin when…
Content applies based on query criteria	Content applies based on explicit opt-in
The set of consumers is determined by criteria matching	The set of consumers is an enumerated list of overlays
Adopt-by-default is desired for new matching overlays	Each consumer should reference it explicitly
You want to add `validation` blocks (mixins don’t carry validation)	You only need `constraints` / `componentRefs`

Precedence when a wildcard overlay and a service-specific leaf collide. FindMatchingOverlays sorts its returned leaves by Criteria.Specificity() ascending, so less-specific overlays merge first and more-specific overlays merge last. Both top-level constraints and spec.validation.<phase> blocks merge per-field — the more-specific leaf adds to or overrides the wildcard’s block without replacing it wholesale:

Top-level spec.constraints merge by name. A same-named constraint from the more-specific leaf overrides the wildcard’s value (the “overridden, new added” rule from the merge algorithm).
spec.validation.<phase> blocks (readiness, deployment, performance, conformance) also merge per-field:
- checks: an omitted field (nil) inherits the parent’s list; an explicit empty list (checks: []) clears the inherited list for that phase; a non-empty list unions with the parent’s, deduplicated and preserving order (parent entries first, then leaf-only entries appended).
- constraints: same nil-vs-empty rule as checks; a non-empty list unions by name with the leaf winning on same-name (analogous to the top-level constraint merge).
- nodeSelection: overlay replaces wholesale when non-nil (full struct replace).
- timeout, infrastructure: overlay-wins-if-non-empty.

A leaf that only needs to tighten one constraint can declare just that constraint — the wildcard’s checks and other constraints are inherited automatically:

1 # recipes/overlays/gb200-eks-training.yaml — illustrative; real leaf has
2 # the same shape with the project's current GB200 floor value.
3 spec:
4   validation:
5     deployment:
6       # `checks` omitted — wildcard's 4 standard checks are inherited.
7       constraints:
8         - name: Deployment.gpu-operator.version
9           value: ">= v25.10.1"           # Tighten past the wildcard's floor.

Leaves may still restate the inherited list when it documents intent (e.g., recording which checks the leaf depends on). With per-field union, restating is a no-op: duplicates dedupe against the wildcard’s entries.

To intentionally drop an inherited check or constraint, declare the field as an explicit empty list (checks: [] / constraints: []) rather than omitting it. Omission means “inherit”; explicit empty means “clear”:

1 # recipes/overlays/h100-eks-ubuntu-training-slurm.yaml — Slurm-managed
2 # clusters bypass slurmd for K8s-native NCCL tests, so the leaf clears
3 # the inherited performance phase rather than running it.
4 spec:
5   validation:
6     performance:
7       checks: []        # Explicit clear — drops inherited nccl-all-reduce-bw.
8       constraints: []   # Same — clears the inherited >= 300 floor.

This distinction is preserved by yaml.v3: a YAML [] decodes as a non-nil empty slice (clear), while an omitted or null field decodes as nil (inherit).

Criteria-wildcard overlays are only appropriate when the content is genuinely uniform across the wildcard dimension. If the value diverges (e.g., H100 NCCL targets differ by cloud: AKS ≥ 100, EKS ≥ 300, GKE ≥ 250), keep it inline in each service-specific overlay — collapsing divergent values to a lowest-common-denominator wildcard silently weakens validation.

See also: ADR-005: Overlay Refactoring — rationale for the maximal-leaf resolver semantics (Phase 2) and why wildcard overlays are preferred over multi-parent inheritance or intermediate-reparenting approaches that were prototyped and rejected.

Cycle Detection

The system detects circular inheritance to prevent infinite loops:

1 # INVALID: Would create cycle
2 # a.yaml: spec.base: b
3 # b.yaml: spec.base: c
4 # c.yaml: spec.base: a  ← Cycle detected!

Tests in pkg/recipe/yaml_test.go automatically validate:

All spec.base references point to existing recipes
No circular inheritance chains exist
Inheritance depth is reasonable (max 10 levels)

Component Configuration

Components are configured through a three-tier system with clear precedence.

Configuration Patterns

Pattern 1: ValuesFile Only Traditional approach - all values in a separate file:

1 componentRefs:
2   - name: gpu-operator
3     valuesFile: components/gpu-operator/values.yaml

Pattern 2: Overrides Only Fully self-contained recipe with inline values:

1 componentRefs:
2   - name: gpu-operator
3     overrides:
4       driver:
5         version: 580.82.07
6       cdi:
7         enabled: true

Pattern 3: ValuesFile + Overrides (Hybrid) Reusable base with recipe-specific tweaks:

1 componentRefs:
2   - name: gpu-operator
3     valuesFile: components/gpu-operator/values.yaml  # Base configuration
4     overrides:                                        # Recipe-specific tweaks
5       driver:
6         version: 580.82.07

Value Merge Precedence

When values are resolved, merge order is:

Base ValuesFile → Overlay ValuesFile → Overlay Overrides → CLI --set flags
     (lowest)                                                   (highest)

Base ValuesFile: Values from inherited recipes
Overlay ValuesFile: Values file specified in the matching overlay
Overlay Overrides: Inline overrides in the overlay’s componentRef
CLI —set flags: Runtime overrides from aicr bundle --set

Component Values Files

Values files are stored in recipes/components/\{component\}/:

1 # components/gpu-operator/values.yaml
2 operator:
3   upgradeCRD: true
4   resources:
5     limits:
6       cpu: 500m
7       memory: 700Mi
8 
9 driver:
10   version: 580.105.08
11   enabled: true
12   useOpenKernelModules: true
13   rdma:
14     enabled: true
15 
16 devicePlugin:
17   enabled: true

Dependency Management

Components can declare dependencies via dependencyRefs:

1 componentRefs:
2   - name: cert-manager
3     type: Helm
4     version: v1.20.2
5 
6   - name: gpu-operator
7     type: Helm
8     version: v25.10.1
9     dependencyRefs:
10       - cert-manager  # Deploy cert-manager first

The system performs topological sort to compute deployment order, ensuring dependencies are deployed before dependents. The resulting order is exposed in RecipeResult.DeploymentOrder.

Regenerating the BOM

docs/user/container-images.md is an auto-generated bill of materials listing every container image AICR pulls across all components. It is regenerated by make bom-docs, which renders each Helm chart against its live OCI source and extracts image references from the rendered templates.

Run make bom-docs and commit the regenerated docs/user/container-images.md in the same PR whenever you:

Add or remove a component in recipes/registry.yaml
Bump a chart version (in registry.yaml, an overlay, or a mixin)
Modify a component’s values.yaml in a way that changes which images render (image repo override, subchart enable/disable, etc.)

The regen can also surface drift from upstream chart updates — when a chart bumps an image inside its own templates without a registry pin change on our side. That drift will appear in the BOM diff whether you expected it or not. Land it as part of the same PR that triggered the regen, or split it out as a separate “BOM catch-up” PR if the unrelated diff would obscure the primary change.

Freshness is not gated. make bom-check verifies the committed docs/user/container-images.md matches a fresh regen, but it is opt-in — neither make qualify nor make lint runs it today, and the merge gate has no PR-time BOM-staleness check (it only runs bom-pinning-check, which is the chart-pin verification per ADR-006). Run make bom-docs explicitly whenever you touch a component; do not rely on local qualify or CI to catch a missed regen. Wiring bom-check into the gate is a desirable follow-up.

Criteria Matching Algorithm

The recipe system uses an asymmetric rule matching algorithm where recipe criteria (rules) match against user queries (candidates).

Matching Rules

A recipe’s criteria matches a user query when every non-”any” field in the criteria is satisfied by the query:

Empty/unpopulated fields in recipe criteria = Wildcard (matches any query value)
Populated fields must match exactly (case-insensitive)
Matching is asymmetric: A recipe with specific fields (e.g., accelerator: h100) will NOT match a generic query (e.g., accelerator: any)

Asymmetric Matching Explained

The key insight is that matching is one-directional:

Recipe “any” (or empty) → Matches ANY query value (acts as wildcard)
Query “any” → Only matches recipe “any” (does NOT match specific recipes)

This prevents overly specific recipes from being selected when the user hasn’t specified those criteria.

Matching Logic

1 // Asymmetric matching: recipe criteria as receiver, query as parameter
2 func (c *Criteria) Matches(other *Criteria) bool \{
3     // If recipe (c) is "any" (or empty), it matches any query value (wildcard).
4     // If query (other) is "any" but recipe is specific, it does NOT match.
5     // If both have specific values, they must match exactly.
6     
7     // For each field, call matchesCriteriaField(recipeValue, queryValue)
8     // ...
9     return true
10 \}
11 
12 // matchesCriteriaField implements asymmetric matching for a single field.
13 func matchesCriteriaField(recipeValue, queryValue string) bool \{
14     recipeIsAny := recipeValue == "any" || recipeValue == ""
15     queryIsAny := queryValue == "any" || queryValue == ""
16 
17     // If recipe is "any", it matches any query value (recipe is generic)
18     if recipeIsAny \{
19         return true
20     \}
21 
22     // Recipe has a specific value
23     // Query must also have that specific value (not "any")
24     if queryIsAny \{
25         // Query is generic but recipe is specific - no match
26         return false
27     \}
28 
29     // Both have specific values - must match exactly
30     return recipeValue == queryValue
31 \}

Specificity Scoring

When multiple recipes match, they are sorted by specificity (number of non-”any” fields). More specific recipes are applied later, giving them higher precedence:

1 func (c *Criteria) Specificity() int \{
2     score := 0
3     if c.Service != "any" \{ score++ \}
4     if c.Accelerator != "any" \{ score++ \}
5     if c.Intent != "any" \{ score++ \}
6     if c.OS != "any" \{ score++ \}
7     if c.Nodes != 0 \{ score++ \}
8     return score
9 \}

Matching Examples

Example 1: Broad Recipe Matches Specific Query

1 Recipe Criteria: \{ service: "eks" \}
2 User Query:      \{ service: "eks", os: "ubuntu", accelerator: "h100" \}
3 Result:          ✅ MATCH - Recipe only requires service=eks, other fields are wildcards
4 Specificity:     1

Example 2: Specific Recipe Doesn’t Match Different Specific Query

1 Recipe Criteria: \{ service: "eks", accelerator: "gb200", intent: "training" \}
2 User Query:      \{ service: "eks", os: "ubuntu", accelerator: "h100" \}
3 Result:          ❌ NO MATCH - Accelerator doesn't match (gb200 ≠ h100)

Example 3: Specific Recipe Doesn’t Match Generic Query (Asymmetric)

1 Recipe Criteria: \{ service: "eks", accelerator: "gb200", intent: "training" \}
2 User Query:      \{ service: "eks", intent: "training" \}  # accelerator unspecified = "any"
3 Result:          ❌ NO MATCH - Recipe requires gb200, query has "any" (wildcard doesn't match specific)

This asymmetric behavior ensures that a generic query like --service eks --intent training only matches generic recipes, not hardware-specific ones like gb200-eks-training.yaml.

Example 4: Multiple Maximal Matches (Fully Specific Query)

1 User Query: \{ service: "eks", os: "ubuntu", accelerator: "gb200", intent: "training" \}
2 
3 Overlay criteria matches (pre-filter):
4   1. overlays/monitoring-hpa.yaml             \{ intent: any \}                                           Specificity: 0
5   2. overlays/eks.yaml                        \{ service: eks \}                                          Specificity: 1
6   3. overlays/eks-training.yaml               \{ service: eks, intent: training \}                        Specificity: 2
7   4. overlays/gb200-any.yaml                  \{ service: any, accelerator: gb200 \}                      Specificity: 1
8   5. overlays/gb200-eks-training.yaml         \{ service: eks, accelerator: gb200, intent: training \}    Specificity: 3
9   6. overlays/gb200-eks-ubuntu-training.yaml  \{ service: eks, accelerator: gb200, os: ubuntu, intent: training \}  Specificity: 4
10 
11 (base.yaml is the root spec, not an overlay candidate: FindMatchingOverlays
12 iterates s.Overlays only. The base spec is always applied as the seed for
13 the merged output — it is not selected by criteria matching.)
14 
15 Maximal leaves (after filterToMaximalLeaves):
16   - monitoring-hpa             (no matching descendant)
17   - gb200-any                  (no matching descendant)
18   - gb200-eks-ubuntu-training  (most-specific overlay; eks, eks-training,
19                                 gb200-eks-training are ancestors and are filtered out)
20 
21 Result: Each maximal leaf's inheritance chain is resolved and merged onto
22 the base spec. Ancestors removed by the filter re-enter the output via
23 chain resolution (step 3), so the final appliedOverlays is
24 [base, monitoring-hpa, gb200-any, eks, eks-training,
25 gb200-eks-training, gb200-eks-ubuntu-training].

Note that multiple maximal leaves can coexist when their inheritance chains are independent — gb200-any (via wildcard service: any) and gb200-eks-ubuntu-training (via explicit criteria) are both kept because neither is an ancestor of the other. This is what enables the criteria-wildcard overlay pattern.

Cluster Fingerprint

aicr snapshot emits a structured fingerprint: block alongside the raw measurements. The fingerprint is a normalized, schema-stable view of the cluster-identity dimensions used to bind a snapshot to a recipe — service, accelerator, OS, Kubernetes server version, region, total node count, and GPU node count — so an evidence bundle (per ADR-007) can prove the recipe was tested on hardware matching its declared criteria.

The fingerprint is derived from the same collector outputs that populate measurements:; it is not a separate collection pass. Dimensions whose source signal is missing surface as zero-value entries — the verifier treats those as “unknown” rather than fabricating a match.

The persisted fingerprint: block is advisory only. It is a convenience for humans reading the snapshot YAML, not a trust-bearing claim. The snapshot file is not signed at this layer — an attacker controlling it could swap the embedded fingerprint without touching the measurements that back it. Trust-bearing consumers — the evidence bundler (#754), the verifier (#753), and any downstream policy gate — MUST recompute via fingerprint.FromMeasurements(snap.Measurements) before acting on the result, and treat zero-value entries as “unknown” per the match semantics below.

Fingerprint Schema

1 fingerprint:
2   service:
3     value: eks                       # eks | gke | aks | oke | kind | lke | bcm
4     source: k8s.node.provider
5   accelerator:
6     value: h100                      # h100 | h200 | gb200 | b200 | a100 | l40 | rtx-pro-6000
7     source: gpu.smi.gpu.model
8   os:
9     value: ubuntu                    # ubuntu | rhel | cos | amazonlinux | talos
10     version: "22.04"                 # raw VERSION_ID for audit; not in criteria
11     source: os.release
12   k8sVersion:
13     value: "1.33.4"                  # leading "v" stripped
14     source: k8s.server.version
15   region:                            # value empty when multi-region or no label
16     value: us-west-2
17     source: nodeTopology.label.topology.kubernetes.io/region
18   nodeCount:                         # all nodes including control plane
19     value: 12
20     source: nodeTopology.summary.node-count
21   gpuNodeCount:                      # nodes carrying the GPU operator label
22     value: 8
23     source: nodeTopology.label.nvidia.com/gpu.product

Heterogeneous and stale-registry dimensions

When accelerator or region cannot be collapsed to a single Value, the fingerprint surfaces the reason via an optional note: field instead of fabricating one. The verifier renders this distinct from “value not captured” in its Markdown output. Three notes are emitted today:

multi-region — nodes carry different topology.kubernetes.io/region labels
multi-gpu — nodes carry different nvidia.com/gpu.product labels
unknown-sku — nvidia-smi or the GPU operator reported a product string that is not in the recipe accelerator registry (likely registry staleness; the raw model is still recoverable from the underlying measurement)

1 fingerprint:
2   accelerator:                       # nodes disagree on GPU SKU
3     value: ""
4     source: nodeTopology.label.nvidia.com/gpu.product
5     note: multi-gpu
6   region:                            # nodes disagree on region
7     value: ""
8     source: nodeTopology.label.topology.kubernetes.io/region
9     note: multi-region
10   # Or, for an unrecognized SKU:
11   # accelerator:
12   #   value: ""
13   #   source: gpu.smi.gpu.model
14   #   note: unknown-sku

Every dimension carries a value (the resolved, normalized string the recipe criteria block can be compared against), a source string identifying which collector signal produced it, and an optional note string carrying a short audit hint when Value is empty for a reason other than missing data (the cases above). ADR-007 reserves additional optional fields (signals[], confidence) for a future multi-signal corroboration extension; V1 records source and note only.

Detection Sources

Dimension	Source	Normalization
`service`	`k8s.node.provider` (parsed from `spec.providerID`)	`aws → eks`, `gce → gke`, `azure → aks`, `oci → oke`, else passthrough
`accelerator`	`gpu.smi.gpu.model` (nvidia-smi `ProductName`)	Substring match against the recipe accelerator enum (`GB200` matched before `B200`)
`os.value`	`/etc/os-release` `ID`	Mapped to the `oskind` enum; aliases like `redhat → rhel` and `al2 → amazonlinux` are recognized
`os.version`	`/etc/os-release` `VERSION_ID`	Retained verbatim for audit
`k8sVersion`	`k8s.server.version`	Leading `v` stripped
`region`	`nodeTopology.label.topology.kubernetes.io/region`	Single-region clusters surface the value; multi-region clusters surface `note: multi-region` with empty value
`nodeCount`	`nodeTopology.summary.node-count`	All nodes, control plane included
`gpuNodeCount`	`nodeTopology.label.nvidia.com/gpu.product`	Union of nodes across one or more GPU-product label entries (the canonical GPU-operator presence signal); zero when no GPU operator labels are present
`accelerator` (cluster-wide override)	same as `gpuNodeCount`	When the topology label data shows multiple distinct GPU SKUs across nodes, accelerator surfaces `note: multi-gpu` with empty value — preferring honesty over the smi reading from a single node

A dimension whose source signal is missing keeps its zero value. The verifier reports it as unknown rather than mismatched.

Match Semantics

fingerprint.Fingerprint.Match compares a fingerprint against a recipe’s criteria and returns a per-dimension diff plus an overall matched flag. Each criteria dimension resolves to one of three outcomes:

matched — the recipe is generic (any / empty) for this dimension, OR the fingerprint captured the same value the recipe requires.
mismatched — the recipe requires a specific value and the fingerprint captured a different specific value.
unknown — the recipe requires a specific value but the fingerprint cannot prove or disprove it. Two cases produce unknown: a dimension the cluster does not reveal (intent, platform — recipe-author choices) and a dimension the fingerprint failed to detect (e.g., no GPU collector output).

The overall matched flag is true when no dimension is mismatched. Unknowns surface in the per-dimension diff for human review without flipping the overall outcome — the fingerprint cannot disprove a match it does not capture.

Worked Example

Recipe criteria: service=eks, accelerator=h100, intent=training, os=ubuntu, platform=kubeflow plus the fingerprint above.

1 matched: true
2 perDimension:
3 - \{dimension: service,     recipeRequires: eks,      fingerprintProvides: eks,    match: matched\}
4 - \{dimension: accelerator, recipeRequires: h100,     fingerprintProvides: h100,   match: matched\}
5 - \{dimension: os,          recipeRequires: ubuntu,   fingerprintProvides: ubuntu, match: matched\}
6 - \{dimension: intent,      recipeRequires: training,                              match: unknown\}
7 - \{dimension: platform,    recipeRequires: kubeflow,                              match: unknown\}
8 - \{dimension: nodes,                                 fingerprintProvides: 12,     match: matched\}

perDimension is an ordered list so iteration is deterministic and serialization is byte-stable; consumers needing lookup by name use MatchResult.Find rather than indexing.

The bundle’s predicate body (per ADR-007 PR-A / #754) records this diff as criteriaMatch.perDimension; the verifier (#753) renders it in a Markdown summary so the maintainer sees exactly which dimensions the fingerprint corroborated.

The predicate body preserves the three-way match: state verbatim (matched | mismatched | unknown) rather than collapsing to a bool. The ADR-007 example shows match: true for the happy-path case where every dimension is matched, but the schema must keep unknown distinguishable from matched — a maintainer reviewing a bundle needs to see “intent and platform were not corroborated by the fingerprint” rather than “everything matched.” A CI gate keyed on criteriaMatch.matched: true alone gives unknown dimensions a free pass; gates that need stronger guarantees should also assert that no per-dimension entry has match: unknown.

Recipe Generation Process

The recipe builder (pkg/recipe/metadata_store.go) generates recipes through the following steps:

Step 1: Load Metadata Store

1 store, err := loadMetadataStore(ctx)

Embedded YAML files are parsed into Go structs
Cached in memory on first access (singleton pattern with sync.Once)
Contains base recipe, all overlays, mixins, and component values files

Step 2: Find Matching Overlays

1 overlays := store.FindMatchingOverlays(criteria)

Iterate all overlays in s.Overlays (the base recipe is held separately in s.Base and is not a candidate here — it is injected as the merge seed by initBaseMergedSpec() in Step 4)
Check if each overlay’s criteria matches the user query
Filter to maximal leaves via filterToMaximalLeaves(): drop any match that is an ancestor (via spec.base) of another match. Ancestors are re-added later via chain resolution; this filter ensures that a matched descendant isn’t double-counted with its own chain
Sort maximal-leaf matches by specificity (least specific first)

Multiple maximal leaves can be returned for one query when they sit on independent inheritance chains — for example, a service: any wildcard overlay and the most-specific service-specific leaf are both kept (see Criteria-Wildcard Overlays).

Step 3: Resolve Inheritance Chains

For each maximal-leaf match from step 2:

1 chain, err := store.resolveInheritanceChain(overlay.Metadata.Name)

Build the chain from root (base) to the target overlay by walking spec.base
Detect cycles to prevent infinite loops
Example: ["base", "eks", "eks-training", "gb200-eks-ubuntu-training"]

Ancestors filtered out in step 2 re-enter the output here as part of their descendant’s chain.

Step 4: Merge Specifications

The merge starts from a seed containing the base spec, then applies each resolved chain on top:

1 mergedSpec, appliedOverlays := s.initBaseMergedSpec()  // seed with s.Base
2 // ... then for each chain from Step 3:
3 for _, recipe := range chain \{
4     mergedSpec.Merge(&recipe.Spec)
5 \}

This is why base always appears first in appliedOverlays even though it is not returned by FindMatchingOverlays.

Merge Algorithm

Constraints: Same-named constraints are overridden; new constraints are added
ComponentRefs: Same-named components are merged field-by-field using mergeComponentRef()

1 func mergeComponentRef(base, overlay ComponentRef) ComponentRef \{
2     result := base
3     if overlay.Type != "" \{ result.Type = overlay.Type \}
4     if overlay.Source != "" \{ result.Source = overlay.Source \}
5     if overlay.Version != "" \{ result.Version = overlay.Version \}
6     if overlay.ValuesFile != "" \{ result.ValuesFile = overlay.ValuesFile \}
7     // Merge overrides maps
8     if overlay.Overrides != nil \{
9         result.Overrides = deepMerge(base.Overrides, overlay.Overrides)
10     \}
11     // Merge dependency refs
12     if len(overlay.DependencyRefs) > 0 \{
13         result.DependencyRefs = mergeDependencyRefs(base.DependencyRefs, overlay.DependencyRefs)
14     \}
15     return result
16 \}

Step 5: Apply Mixins

1 mixinConstraintNames, err := store.mergeMixins(mergedSpec)

If the leaf overlay declares spec.mixins, each named mixin is loaded from recipes/mixins/
Mixin constraints and componentRefs are appended to the merged spec
Conflict detection prevents duplicates between the inheritance chain, previously applied mixins, and the current mixin
When a snapshot evaluator is provided, mixin constraints are evaluated against it after merging; failure invalidates the entire composed candidate. In plain query mode (no snapshot), mixin constraints are merged but not evaluated

Step 6: Validate Dependencies

1 if err := mergedSpec.ValidateDependencies(); err != nil \{
2     return nil, err
3 \}

Verify all dependencyRefs reference existing components
Detect circular dependencies

Step 7: Compute Deployment Order

1 deployOrder, err := mergedSpec.TopologicalSort()

Topologically sort components based on dependencyRefs
Ensures dependencies are deployed before dependents

Step 8: Build RecipeResult

1 return &RecipeResult\{
2     Kind:            "RecipeResult",
3     APIVersion:      "aicr.nvidia.com/v1alpha1",
4     Metadata:        metadata,
5     Criteria:        criteria,
6     Constraints:     mergedSpec.Constraints,
7     ComponentRefs:   mergedSpec.ComponentRefs,
8     DeploymentOrder: deployOrder,
9 \}, nil

Complete Flow Diagram

Usage Examples

CLI Usage

Basic recipe generation (query mode):

$ aicr recipe --os ubuntu --service eks --accelerator h100 --intent training

Full specification:

$ aicr recipe \
>   --os ubuntu \
>   --service eks \
>   --accelerator gb200 \
>   --intent training \
>   --nodes 8 \
>   --format yaml \
>   --output recipe.yaml

From snapshot (snapshot mode):

$ aicr snapshot --output snapshot.yaml
$ aicr recipe --snapshot snapshot.yaml --intent training --output recipe.yaml

API Usage

Basic query:

$ curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks&accelerator=h100"

Full specification:

$ curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks&accelerator=gb200&intent=training&nodes=8"

Example Response (RecipeResult)

1 \{
2   "kind": "RecipeResult",
3   "apiVersion": "aicr.nvidia.com/v1alpha1",
4   "metadata": \{
5     "version": "v0.8.0",
6     "appliedOverlays": [
7       "base",
8       "eks",
9       "eks-training",
10       "gb200-eks-ubuntu-training"
11     ]
12   \},
13   "criteria": \{
14     "service": "eks",
15     "accelerator": "gb200",
16     "os": "ubuntu",
17     "intent": "training"
18   \},
19   "constraints": [
20     \{
21       "name": "K8s.server.version",
22       "value": ">= 1.32.4"
23     \},
24     \{
25       "name": "OS.release.ID",
26       "value": "ubuntu"
27     \},
28     \{
29       "name": "OS.release.VERSION_ID",
30       "value": "24.04"
31     \}
32   ],
33   "componentRefs": [
34     \{
35       "name": "cert-manager",
36       "type": "Helm",
37       "source": "https://charts.jetstack.io",
38       "version": "v1.20.2",
39       "valuesFile": "components/cert-manager/values.yaml"
40     \},
41     \{
42       "name": "gpu-operator",
43       "type": "Helm",
44       "source": "https://helm.ngc.nvidia.com/nvidia",
45       "version": "v25.3.3",
46       "valuesFile": "components/gpu-operator/values-eks-training.yaml",
47       "overrides": \{
48         "driver": \{
49           "version": "580.82.07"
50         \},
51         "cdi": \{
52           "enabled": true
53         \}
54       \},
55       "dependencyRefs": ["cert-manager"]
56     \},
57     \{
58       "name": "nvsentinel",
59       "type": "Helm",
60       "source": "oci://ghcr.io/nvidia/nvsentinel",
61       "version": "v0.6.0",
62       "valuesFile": "components/nvsentinel/values.yaml",
63       "dependencyRefs": ["cert-manager"]
64     \},
65     \{
66       "name": "nodewright-operator",
67       "type": "Helm",
68       "source": "oci://ghcr.io/nvidia/skyhook",
69       "version": "v0.15.0",
70       "valuesFile": "components/nodewright-operator/values.yaml",
71       "overrides": \{
72         "customization": "ubuntu"
73       \}
74     \}
75   ],
76   "deploymentOrder": [
77     "cert-manager",
78     "gpu-operator",
79     "nvsentinel",
80     "nodewright-operator"
81   ]
82 \}

Maintenance Guide

Adding a New Recipe

Create the recipe file in recipes/:

1 kind: RecipeMetadata
2 apiVersion: aicr.nvidia.com/v1alpha1
3 metadata:
4   name: l40-gke-ubuntu-inference  # Unique name
5 
6 spec:
7   base: gke-inference  # Inherit from appropriate parent
8   
9   criteria:
10     service: gke
11     accelerator: l40
12     os: ubuntu
13     intent: inference
14   
15   constraints:
16     - name: K8s.server.version
17       value: ">= 1.29"
18   
19   componentRefs:
20     - name: gpu-operator
21       version: v25.3.3
22       overrides:
23         driver:
24           version: 560.35.03

Create intermediate recipes if needed (e.g., gke.yaml, gke-inference.yaml)

Add component values files if using new configurations:

1 # components/gpu-operator/values-gke-inference.yaml
2 driver:
3   enabled: true
4   version: 560.35.03

Run tests to validate:

$ go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly

Modifying Existing Recipes

Update constraints - Change version requirements:

1 constraints:
2   - name: K8s.server.version
3     value: ">= 1.33"  # Updated from 1.32

Update component versions - Bump chart versions:

1 componentRefs:
2   - name: gpu-operator
3     version: v25.4.0  # Updated from v25.3.3

Add inline overrides - Recipe-specific tweaks:

1 componentRefs:
2   - name: gpu-operator
3     overrides:
4       newFeature:
5         enabled: true

Updating Component Values

Modify values file in recipes/components/\{component\}/values.yaml
Create variant values file for specific environments:
- values.yaml - Base configuration
- values-eks-training.yaml - EKS training optimization
- values-gke-inference.yaml - GKE inference optimization

Reference in recipe:

1 componentRefs:
2   - name: gpu-operator
3     valuesFile: components/gpu-operator/values-gke-inference.yaml

Automated Validation

The recipe data system includes comprehensive automated tests to ensure data integrity. These tests run automatically as part of make test and validate all recipe metadata files and component values.

Test Suite Overview

The test suite is organized in pkg/recipe/:

File	Responsibility
`yaml_test.go`	Static YAML file validation (parsing, references, enums, inheritance)
`metadata_test.go`	Runtime behavior tests (Merge, TopologicalSort, inheritance resolution)
`recipe_test.go`	Recipe struct validation (Validate, ValidateStructure)

Test Categories

Test Category	What It Validates
Schema Conformance	All YAML files parse correctly with expected structure
Criteria Validation	Valid enum values for service, accelerator, intent, OS
Reference Validation	valuesFile paths exist, dependencyRefs resolve, component names valid
Constraint Syntax	Measurement paths use valid types, operators are valid
Dependency Cycles	No circular dependencies in componentRefs
Inheritance Chains	Base references valid, no circular inheritance, reasonable depth
Values Files	Component values files parse as valid YAML

Inheritance-Specific Tests

Test	What It Validates
`TestAllBaseReferencesPointToExistingRecipes`	All `spec.base` references resolve to existing recipes
`TestNoCircularBaseReferences`	No circular inheritance chains (a→b→c→a)
`TestInheritanceChainDepthReasonable`	Inheritance depth ≤ 10 levels

Running Tests

$ # Run all recipe data tests
$ make test
$ 
$ # Run only recipe package tests
$ go test -v ./pkg/recipe/... -count=1
$ 
$ # Run specific test patterns
$ go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly
$ go test -v ./pkg/recipe/... -run TestAllBaseReferencesPointToExistingRecipes
$ go test -v ./pkg/recipe/... -run TestAllOverlayCriteriaUseValidEnums

CI/CD Integration

Tests run automatically on:

Pull Requests: All tests must pass before merge
Push to main: Validates no regressions
Release builds: Ensures data integrity in released binaries

1 # GitHub Actions workflow snippet
2 jobs:
3   validate:
4     runs-on: ubuntu-latest
5     steps:
6       - uses: actions/checkout@v5
7       - uses: ./.github/actions/go-ci
8         with:
9           go_version: '1.26'
10           golangci_lint_version: 'v2.11.3'

Adding New Tests

When adding new recipe metadata or component configurations:

Create the new file in recipes/

Run tests to verify the file is valid:

$ go test -v ./pkg/recipe/... -run TestAllMetadataFilesParseCorrectly

Check for conflicts with existing recipes:

$ go test -v ./pkg/recipe/... -run TestNoDuplicateCriteria

Verify references if using valuesFile or dependencyRefs:

$ go test -v ./pkg/recipe/... -run TestValuesFileReferencesExist
$ go test -v ./pkg/recipe/... -run TestDependencyRefsResolve

External Data Provider

The recipe system supports extending or overriding embedded data with external files via the --data CLI flag. This enables customization without rebuilding the CLI binary.

Architecture Overview

Data Provider Interface

The system uses a DataProvider interface to abstract file access:

1 type DataProvider interface \{
2     // ReadFile reads a file by path (relative to data directory).
3     // Returns ctx's error if it is canceled before the read completes.
4     ReadFile(ctx context.Context, path string) ([]byte, error)
5 
6     // WalkDir walks the directory tree rooted at root.
7     // Returns ctx's error if it is canceled mid-walk.
8     WalkDir(ctx context.Context, root string, fn fs.WalkDirFunc) error
9 
10     // Source returns where data came from (for debugging)
11     Source(path string) string
12 \}

Cancellation is honored at I/O boundaries — before each file open and between WalkDir entries — not mid-syscall on an in-flight read. LayeredDataProvider reads external files via os.Open + io.ReadAll, which are not cancelable once started; a hung NFS / sshfs mount blocked mid-readv will see cancellation honored on the next file the walk touches, not the one currently blocked.

Provider Types:

EmbeddedDataProvider: Wraps Go’s embed.FS for compile-time embedded data
LayeredDataProvider: Overlays external directory on top of embedded data

Merge Behavior

File Type	Behavior	Example
`registry.yaml`	Merged by component name	External adds/replaces components
`overlays/base.yaml`	Replaced if exists externally	External completely overrides embedded
`overlays/*.yaml`	Replaced if same path	External overlay replaces embedded
`components/*/values.yaml`	Replaced if same path	External values override embedded

Registry Merge Algorithm

When merging registry.yaml, components are matched by their name field:

1 func mergeRegistries(embedded, external *ComponentRegistry) *ComponentRegistry \{
2     // 1. Index external components by name
3     externalByName := make(map[string]*ComponentConfig)
4     for _, comp := range external.Components \{
5         externalByName[comp.Name] = comp
6     \}
7 
8     // 2. Add embedded components, replacing with external if present
9     for _, comp := range embedded.Components \{
10         if ext, found := externalByName[comp.Name]; found \{
11             result.Components = append(result.Components, *ext)  // External wins
12         \} else \{
13             result.Components = append(result.Components, comp)  // Keep embedded
14         \}
15     \}
16 
17     // 3. Add new components from external (not in embedded)
18     for _, comp := range external.Components \{
19         if !addedNames[comp.Name] \{
20             result.Components = append(result.Components, comp)
21         \}
22     \}
23 
24     return result
25 \}

Merge Order:

Start with all embedded components
Replace any that have same name in external
Add any new components from external

Security Validations

The LayeredDataProvider enforces security constraints:

Validation	Behavior
Directory exists	External directory must exist and be a directory
registry.yaml required	External directory must contain `registry.yaml`
No path traversal	Paths containing `..` are rejected
No symlinks	Symlinks are rejected by default (`AllowSymlinks: false`)
File size limit	Files exceeding 10MB are rejected (configurable)

Configuration Options

1 type LayeredProviderConfig struct \{
2     // ExternalDir is the path to the external data directory
3     ExternalDir string
4 
5     // MaxFileSize is the maximum allowed file size in bytes (default: 10MB)
6     MaxFileSize int64
7 
8     // AllowSymlinks allows symlinks in the external directory (default: false)
9     AllowSymlinks bool
10 \}

Usage Example

Creating an external data directory:

my-data/
├── registry.yaml              # Required - merged with embedded
├── overlays/
│   └── my-custom-overlay.yaml # Adds new overlay
└── components/
    └── gpu-operator/
        └── values.yaml        # Replaces embedded gpu-operator values

External registry.yaml (adds custom Helm component):

1 apiVersion: aicr.nvidia.com/v1alpha1
2 kind: ComponentRegistry
3 components:
4   - name: my-custom-operator
5     displayName: My Custom Operator
6     helm:
7       defaultRepository: https://my-charts.example.com
8       defaultChart: my-custom-operator
9       defaultVersion: v1.0.0

External registry.yaml (adds custom Kustomize component):

1 apiVersion: aicr.nvidia.com/v1alpha1
2 kind: ComponentRegistry
3 components:
4   - name: my-kustomize-app
5     displayName: My Kustomize App
6     valueOverrideKeys:
7       - mykustomize
8     kustomize:
9       defaultSource: https://github.com/example/my-app
10       defaultPath: deploy/production
11       defaultTag: v1.0.0

Note: A component must have either helm OR kustomize configuration, not both.

CLI usage:

$ # Generate recipe with external data
$ aicr recipe --service eks --accelerator h100 --data ./my-data
$ 
$ # Bundle with external data
$ aicr bundle --recipe recipe.yaml --data ./my-data --output ./bundles

Debugging

Use --debug flag to see detailed logging about external data loading:

$ aicr --debug recipe --service eks --data ./my-data

Debug logs include:

Files discovered in external directory
Source resolution for each file (embedded vs external vs merged)
Component merge details (added, overridden, retained)

Builder-bound providers

WithDataProvider is the canonical way to attach a DataProvider to a recipe build. The returned Builder resolves its metadata store, component registry, and per-component values files through the bound provider — never through the process-global one. Each call site that builds recipes should construct its own provider and pass it through WithDataProvider:

1 embedded := recipe.NewEmbeddedDataProvider(recipe.GetEmbeddedFS(), "")
2 builder := recipe.NewBuilder(
3     recipe.WithVersion(version),
4     recipe.WithDataProvider(embedded),
5 )
6 result, err := builder.BuildFromCriteria(ctx, criteria)

The resulting *RecipeResult carries the same provider so downstream consumers (GetValuesForComponent, GetManifestContentWithProvider) resolve files against the build’s provider rather than whatever global is currently installed. Recover it via result.DataProvider() — nil-safe on the receiver and returns nil when the result was built against the package-global fallback.

Per-Builder isolation

Builders with distinct providers do not share cache state. LoadMetadataStoreFor and GetComponentRegistryFor key their caches by DataProvider identity, so a process can host multiple tenants concurrently without one tenant’s --data overlay leaking into another’s catalog. Use this pattern in multi-tenant servers, test harnesses that need clean state per case, or anywhere two Builder instances may evaluate different inputs at the same time:

1 // Each tenant gets its own provider — no cache cross-pollution
2 embedded := recipe.NewEmbeddedDataProvider(recipe.GetEmbeddedFS(), "")
3 tenantA := recipe.NewBuilder(recipe.WithDataProvider(embedded))
4 tenantB := recipe.NewBuilder(recipe.WithDataProvider(otherProvider))
5 
6 resA, _ := tenantA.BuildFromCriteria(ctx, criteriaA)
7 resB, _ := tenantB.BuildFromCriteria(ctx, criteriaB)
8 // resA.DataProvider() != resB.DataProvider()
9 // resA.GetValuesForComponent("gpu-operator") reads from embedded
10 // resB.GetValuesForComponent("gpu-operator") reads from otherProvider

To force a rebuild on the next read (e.g., after rewriting an external overlay on disk), drop the entries for that provider:

1 recipe.EvictCachedStore(provider)
2 recipe.EvictCachedRegistry(provider)

Evict* is a no-op on a nil receiver, and concurrent builders against other providers are unaffected — eviction is scoped to the supplied provider only.

CLI initialization

Each CLI command owns a per-command aicr.Client whose own DataProvider backs recipe resolution and the per-provider criteria registry — no process-global provider is installed:

1 // pkg/cli/root.go
2 func recipeClientFromCmd(cmd *cli.Command, cfg *config.AICRConfig) (*aicr.Client, error) \{
3     dataDir := cmd.String("data")
4     if dataDir == "" \{
5         dataDir = cfg.Recipe().DataDir()
6     \}
7     source := aicr.EmbeddedSource()
8     if dataDir != "" \{
9         source = aicr.FilesystemSource(dataDir) // external dir layered over embedded
10     \}
11     return aicr.NewClient(
12         aicr.WithRecipeSource(source),
13         aicr.WithVersion(version),
14     )
15     // Caller MUST defer client.Close().
16 \}

The Client binds its provider through recipe.WithDataProvider internally, so concurrent commands (and multi-tenant servers) stay isolated. Library callers construct their own aicr.Client, or call recipe.NewBuilder(recipe.WithDataProvider(dp)) directly.

The former package-global accessors (SetDataProvider, GetDataProvider, GetDataProviderGeneration) have been removed — bind a provider via WithDataProvider and recover it with (*RecipeResult).DataProvider(). Provider-bound helpers that need a default fall back to a single shared embedded provider internally; callers no longer pass the global.

Criteria Registry

The criteria registry is a per-DataProvider cache of valid criteria values (service, accelerator, intent, os, platform) populated from loaded overlays. It is the mechanism by which a --data overlay can introduce a new criteria value (e.g., service: ncp-internal) and have it admitted by (*CriteriaRegistry).ParseService without a code change.

Each aicr.Client (and so each Builder constructed with WithDataProvider) owns its own registry, scoped by DataProvider identity. Concurrent clients with different --data directories do not share registry state; closing the client evicts its entry.

Why a registry

Before the registry existed, each criteria parser had a hardcoded switch of valid string values; an unknown value returned ErrCodeInvalidRequest before the overlay catalog was even consulted. That made it impossible to add a new criteria value via --data — internal/proprietary values required either a fork or an upstream contribution, neither of which scales for undisclosed NCPs or proprietary product overlays.

The registry decouples which values are valid from what the OSS binary knows about. The fast-path switch arms remain for canonical aliases (self-managed → any, al2 → amazonlinux), and any value not matched there falls through to the registry, which is seeded by the overlay loader on catalog load.

Origin tracking

Each registered value carries a CriteriaOrigin:

OriginEmbedded — declared in an overlay loaded from the binary’s embedded data filesystem (the OSS catalog).
OriginExternal — declared in an overlay loaded from --data.

When the same value is registered from both sources, embedded wins — Register never downgrades an embedded value to external, so strict mode lookups remain stable across reloads.

Strict mode

Strict mode hides external-origin entries from registry lookups, restoring the pre-registry behavior in which only OSS canonical values validate. Three sources can enable it (logical OR):

--criteria-strict CLI flag (added on aicr recipe).
spec.recipe.criteriaStrict: true in --config.
AICR_CRITERIA_STRICT=1 env var (read at registry construction — NewCriteriaRegistry / GetCriteriaRegistryFor).

Strict mode is intended for OSS CI gates — make qualify exports AICR_CRITERIA_STRICT=1 for the unit-test step so the upstream catalog cannot accidentally start depending on internal-only values that only exist in someone’s --data directory.

Seeding the registry

The metadata-store loader (pkg/recipe/metadata_store.go) walks every overlay during catalog load and stages each overlay’s criteria for registration. The provider’s Source(path) returns "embedded" / "external" / "merged"; the seed helper maps "embedded" to OriginEmbedded and every non-embedded source (including "merged" and any unknown future category) to OriginExternal. The registration is deferred until after all overlays parse cleanly, the base recipe is present, and dependency validation passes — partial catalog loads never leak into the registry.

Eager load via `LoadCatalog`

The metadata store loads lazily on first read. Because criteria validation runs before the recipe build pulls the catalog, a fresh process with --data would otherwise reject a custom criteria value on the very first call — the registry would still be empty.

(*aicr.Client).LoadCatalog(ctx) forces an eager catalog parse, seeding the Client’s per-provider registry. The CLI recipe Action calls it right after constructing the Client so the registry is populated before any criteria lookup. Any caller that wires its own --data provider must likewise call LoadCatalog before validating criteria for the same reason. (The package-level recipe.LoadCatalog(ctx) seeds the default embedded provider’s registry and is used by the embedded-only path.)

API surface

Function	Purpose
`NewCriteriaRegistry()`	Constructs an empty ephemeral registry (OSS fast-path values only; honors `AICR_CRITERIA_STRICT`).
`GetCriteriaRegistryFor(dp)`	Returns the per-`DataProvider` registry, seeded from that provider’s overlays (the common case).
`(*CriteriaRegistry).Register(field, value, origin)`	Records a value; embedded never downgrades.
`(*CriteriaRegistry).Has(field, value)`	Lookup; respects strict mode (external hidden when strict).
`(*CriteriaRegistry).HasEmbedded(field, value)`	Embedded-only lookup, regardless of strict.
`(*CriteriaRegistry).Values(field)`	Sorted union of known values for help / autocomplete.
`(*CriteriaRegistry).SetStrict(bool)`	Toggle strict mode. Composes with `AICR_CRITERIA_STRICT`.
`(*CriteriaRegistry).Reset()`	Test helper; re-reads `AICR_CRITERIA_STRICT` from env.
`(*aicr.Client).LoadCatalog(ctx)`	Eager catalog load — seeds the Client’s per-provider registry.
`(*CriteriaRegistry).All\{Service,Accelerator,Intent,OS,Platform\}Types()`	Union of static OSS list + values registered in this registry.
`GetCriteria\{Service,Accelerator,Intent,OS,Platform\}Types()`	Static OSS list only (stable; not affected by `--data`).