Component Catalog
AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.
Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.
The source of truth is recipes/registry.yaml. Each entry in the registry defines the component’s Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.
See also: Recipe Health reports the structural health of every recipe these components compose into — resolvability and chart-pin hygiene across the whole criteria matrix.
Components
How Components Are Selected
Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:
- Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
- Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
- Intent-specific components (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
- Platform-specific components (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching
--platform. For--platform slurm, all three Slinky pieces (slinky-slurm-operator-crds,slinky-slurm-operator,slinky-slurm) are declared inline per slurm leaf overlay — the same shapedynamo-platformuses across*-inference-dynamoleaves. Leaves that want the operator only inline the CRDs + operator and omit theslinky-slurmcomponentRef. - Accelerator/OS-specific tuning (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.
NFD Topology Updater
Production GPU leaf recipes (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) enable the NFD Topology Updater. It publishes per-node NodeResourceTopology CRDs that describe NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity. Runtime consumers (NUMA-aware schedulers, debugging via kubectl get noderesourcetopologies) can read these CRDs without further configuration.
The Topology Updater requires the kubelet podResources gRPC socket. The KubeletPodResources feature gate has been on by default since Kubernetes 1.15 (Beta) and reached GA in Kubernetes 1.28; AICR’s recipe constraints on the affected leaves require K8s ≥ 1.30 or higher, so this is satisfied in practice. Recipes targeting Kubernetes < 1.15 must enable the feature gate explicitly. Kind / KWOK simulated clusters do not run a real kubelet and therefore leave the Topology Updater disabled — kind-based recipes will not see NodeResourceTopology CRDs.
See the upstream Topology Updater docs for runtime consumer examples.
To see exactly which components appear in a given recipe, generate one:
The output lists every component with its pinned version and configuration values.
Inference Gateway Network Exposure
Inference recipes include the agentgateway component, which deploys an inference-gateway Gateway. The agentgateway controller materializes that Gateway into a Service of type LoadBalancer, so on every cloud the platform provisions an internet-facing load balancer for the (plaintext HTTP, unauthenticated) inference endpoint. By default that load balancer accepts traffic from any source (0.0.0.0/0).
To restrict it to trusted networks, set agentgateway.allowedSourceRanges to a list of CIDR (Classless Inter-Domain Routing) blocks. The values are rendered into the generated Service’s spec.loadBalancerSourceRanges, which the AWS, GCP, Azure, and OCI cloud load balancers all honor — so one setting locks the gateway down on every platform.
Do not use plain --set for this key. --set agentgateway:allowedSourceRanges=<cidr> exits 0 but renders loadBalancerSourceRanges as a bare string instead of a list, producing a type-invalid Service (the gateway may stay open to 0.0.0.0/0, or the CR apply is rejected). Use the list-aware --set-json / --set-file flags from the CLI:
or scope the gateway through a recipe overlay or componentRef override:
The default is intentionally empty rather than a fixed CIDR: a baked-in range would firewall every downstream deployment to one network and lock other operators out of their own gateway. Each operator should scope this to their own trusted networks. An empty list leaves the load balancer open to 0.0.0.0/0.
This setting filters by source IP only; it does not add TLS or authentication to the gateway listener.
Exposure guardrails
Because the open-by-default load balancer is otherwise silent, AICR surfaces it in two places:
- Bundle-time warning. When a bundle includes
agentgatewaywith an unscopedallowedSourceRanges— empty, or including an any-source CIDR such as0.0.0.0/0or::/0—aicr bundleprints a warning that the inference-gateway will be provisioned open to0.0.0.0/0, with the remediation above. This mirrors the existing storage-class PVC warning and does not block bundle generation. - Conformance check. The
inference-gatewayconformance check (run duringaicr validate --phase conformanceon a live cluster) inspects the gateway’sLoadBalancerService and records its exposure as evidence — the source ranges if scoped, or an explicit “open to0.0.0.0/0” finding if not. By default an open gateway is a non-fatal warning (open-by-default is intentional). SetAICR_REQUIRE_SCOPED_INFERENCE_GATEWAY=trueon the validator environment to escalate an open gateway to a check failure (fail-closed policy).
Adding Components
New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Components docs for details.
Upgrade Notes
Migration steps when upgrading from a prior AICR-generated bundle to a newer one that changes how a component delivers its Kubernetes resources.
A generated recipe is a point-in-time artifact of the AICR binary that produced it: the embedded registry, overlays, manifest paths, and chart pins are part of that binary’s surface. When upgrading AICR, regenerate the recipe from scratch with the new binary (aicr recipe ...) before re-bundling. aicr bundle --recipe <old-file> against a newer binary may fail if the saved recipe references manifest paths the new release has moved or removed (see Bundle Generation Fails for the specific error).
gpu-operator: dcgm-exporter ConfigMap moved into the main release
Earlier bundles shipped the dcgm-exporter ConfigMap as a post-manifest in a separate Helm release named gpu-operator-post. The in-cluster ConfigMap therefore carries ownership annotations pointing at that release:
Newer bundles render the ConfigMap directly from the main gpu-operator chart’s dcgmExporter.config.data values. On upgrade, Helm 3 refuses to claim the existing ConfigMap because its annotations point at a different release:
Fresh installs are not affected. To migrate an existing cluster, remove the stale gpu-operator-post release before applying the new bundle.
Raw Helm (per-component bundle / deploy.sh):
helm uninstall removes the ConfigMap it owns; the next gpu-operator upgrade re-creates it from values.
Helmfile — the new bundle no longer references gpu-operator-post, so helmfile apply will not prune it on its own. Run the helm uninstall above first, then helmfile apply.
Argo CD — delete the stale Application (it will not self-prune unless an ApplicationSet was managing it), then sync the updated gpu-operator application:
Flux — delete the stale HelmRelease so Flux uninstalls the release and removes the ConfigMap, then reconcile the updated gpu-operator HelmRelease. The example below assumes the Flux control plane runs in flux-system; substitute the namespace where your Flux installation lives:
After migration, confirm the ConfigMap is owned by the gpu-operator release: