Component Catalog

View as Markdown

AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.

Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.

The source of truth is recipes/registry.yaml. Each entry in the registry defines the component’s Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.

See also: Recipe Health reports the structural health of every recipe these components compose into — resolvability and chart-pin hygiene across the whole criteria matrix.

Components

ComponentDescriptionSource
gpu-operatorManages the GPU driver and runtime lifecycle on Kubernetes nodes. Handles driver installation, container runtime configuration, device plugin, and GPU feature discovery.NVIDIA GPU Operator
network-operatorManages high-performance networking for GPU workloads. Configures RDMA, SR-IOV, and host networking for multi-node communication.NVIDIA Network Operator
nfdNode Feature Discovery — labels nodes with hardware features (PCI device IDs, kernel modules, CPU capabilities). Both gpu-operator and network-operator consume these labels. On production GPU recipes, the Topology Updater publishes per-node NodeResourceTopology CRDs describing NUMA zones and GPU/NIC affinity for downstream NUMA-aware schedulers.Node Feature Discovery
gke-nccl-tcpxoNCCL TCPxO network plugin for GKE. Provides optimized collective communication for multi-node GPU workloads on Google Kubernetes Engine. GKE-specific.
aws-efaDevice plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific.AWS EFA K8s Device Plugin
cert-managerAutomates TLS certificate management. Required by several operators for webhook and API server certificates.cert-manager
gatekeeperAdmission controller for Kubernetes. Enforces policies and governance across the cluster using OPA (Open Policy Agent) ConstraintTemplates and Constraints.Open Policy Agent Gatekeeper
nodewright-operatorOS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes.Nodewright
nodewright-customizationsEnvironment-specific node tuning profiles applied via Nodewright. Extends the operator with kernel params, hugepages, and other host-level configurations.
nvsentinelGPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes.NVSentinel
nvidia-dra-driver-gpuDynamic Resource Allocation (DRA) driver for GPUs. Advertises GPUs via the Kubernetes resource.k8s.io/v1 API instead of the legacy device plugin. Requires Kubernetes 1.34+ (DRA is GA in 1.34). See AKS GPU Setup for details. CLI alias: dradriver.NVIDIA DRA Driver
prometheus-operator-crdsCustom Resource Definitions for the prometheus-operator (Alertmanager, AlertmanagerConfig, PodMonitor, Probe, Prometheus, PrometheusRule, ServiceMonitor, ThanosRuler). Shipped as a separate release so the CRDs land before any chart that creates monitoring CRs; this breaks the helm-diff self-reference that otherwise blocks helmfile apply on a fresh cluster.prometheus-operator-crds
kube-prometheus-stackCluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards. CRDs are installed by the sibling prometheus-operator-crds release (this chart runs with crds.enabled: false).kube-prometheus-stack
prometheus-adapterExposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics.prometheus-adapter
aws-ebs-csi-driverCSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific. Cluster-wide default StorageClass: AICR enables defaultStorageClass.enabled, so this component provisions a cluster-default gp3 StorageClass (ebs-csi-default-sc) on every EKS cluster that includes it — not just inference recipes; training overlays inherit it too. EKS ships no default SC of its own, so this makes dynamic provisioning (e.g. the inference-perf model cache) work zero-config. Two consequences to note: (1) if the cluster already has a default SC, Kubernetes treats multiple defaults as ambiguous — unset the other; (2) a PVC that previously failed-fast on “no default SC” will now silently bind gp3, which can mask a misconfiguration.AWS EBS CSI Driver
k8s-ephemeral-storage-metricsExports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes.k8s-ephemeral-storage-metrics
kai-schedulerDRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology.KAI Scheduler
grovePod lifecycle management for Dynamo inference platform. Installed as a standalone component.Grove
dynamo-platformNVIDIA Dynamo inference serving platform with bundled CRDs. Distributed inference with KV-cache-aware routing, Dynamo request-plane traffic, a NATS-backed Kubernetes event plane for KV-cache events, and disaggregated prefill/decode.Dynamo
agentgateway-crdsCustom Resource Definitions for agentgateway (Kubernetes Gateway API implementation for AI/ML inference).agentgateway
agentgatewayKubernetes Gateway API implementation for AI/ML inference. Implements the Gateway API Inference Extension for model-aware ingress routing to InferencePool backends.agentgateway
k8s-nim-operatorNVIDIA NIM Operator for managing NIM (NVIDIA Inference Microservices) deployments on Kubernetes.K8s NIM Operator
kueueKubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads.Kueue
kubeflow-trainerKubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration.Kubeflow Trainer
slinky-slurm-operator-crdsCustom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the slinky.slurm.net CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management.Slinky Slurm Operator
slinky-slurm-operatorSchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Known limitation: chart v1.1.0 silently ignores operator.nodeSelector and webhook.nodeSelector (current chart behavior, not a planned feature); tracking SlinkyProject/slurm-operator#187 for the upstream fix.Slinky Slurm Operator
slinky-slurmSlinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by slinky-slurm-operator. Declared inline per slurm leaf overlay alongside slinky-slurm-operator-crds and slinky-slurm-operator (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see recipes/components/slinky-slurm/values.yaml.Slinky Slurm Cluster Chart

How Components Are Selected

Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:

  • Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
  • Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
  • Intent-specific components (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
  • Platform-specific components (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching --platform. For --platform slurm, all three Slinky pieces (slinky-slurm-operator-crds, slinky-slurm-operator, slinky-slurm) are declared inline per slurm leaf overlay — the same shape dynamo-platform uses across *-inference-dynamo leaves. Leaves that want the operator only inline the CRDs + operator and omit the slinky-slurm componentRef.
  • Accelerator/OS-specific tuning (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

NFD Topology Updater

Production GPU leaf recipes (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) enable the NFD Topology Updater. It publishes per-node NodeResourceTopology CRDs that describe NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity. Runtime consumers (NUMA-aware schedulers, debugging via kubectl get noderesourcetopologies) can read these CRDs without further configuration.

The Topology Updater requires the kubelet podResources gRPC socket. The KubeletPodResources feature gate has been on by default since Kubernetes 1.15 (Beta) and reached GA in Kubernetes 1.28; AICR’s recipe constraints on the affected leaves require K8s ≥ 1.30 or higher, so this is satisfied in practice. Recipes targeting Kubernetes < 1.15 must enable the feature gate explicitly. Kind / KWOK simulated clusters do not run a real kubelet and therefore leave the Topology Updater disabled — kind-based recipes will not see NodeResourceTopology CRDs.

See the upstream Topology Updater docs for runtime consumer examples.

To see exactly which components appear in a given recipe, generate one:

$aicr recipe --service eks --accelerator h100 --os ubuntu --intent training -o recipe.yaml

The output lists every component with its pinned version and configuration values.

Inference Gateway Network Exposure

Inference recipes include the agentgateway component, which deploys an inference-gateway Gateway. The agentgateway controller materializes that Gateway into a Service of type LoadBalancer, so on every cloud the platform provisions an internet-facing load balancer for the (plaintext HTTP, unauthenticated) inference endpoint. By default that load balancer accepts traffic from any source (0.0.0.0/0).

To restrict it to trusted networks, set agentgateway.allowedSourceRanges to a list of CIDR (Classless Inter-Domain Routing) blocks. The values are rendered into the generated Service’s spec.loadBalancerSourceRanges, which the AWS, GCP, Azure, and OCI cloud load balancers all honor — so one setting locks the gateway down on every platform.

Do not use plain --set for this key. --set agentgateway:allowedSourceRanges=<cidr> exits 0 but renders loadBalancerSourceRanges as a bare string instead of a list, producing a type-invalid Service (the gateway may stay open to 0.0.0.0/0, or the CR apply is rejected). Use the list-aware --set-json / --set-file flags from the CLI:

$aicr bundle -r recipe.yaml \
> --set-json agentgateway:allowedSourceRanges='["216.228.127.128/30"]'

or scope the gateway through a recipe overlay or componentRef override:

1componentRefs:
2 - name: agentgateway
3 type: Helm
4 overrides:
5 allowedSourceRanges:
6 - 216.228.127.128/30 # e.g. corporate egress

The default is intentionally empty rather than a fixed CIDR: a baked-in range would firewall every downstream deployment to one network and lock other operators out of their own gateway. Each operator should scope this to their own trusted networks. An empty list leaves the load balancer open to 0.0.0.0/0.

This setting filters by source IP only; it does not add TLS or authentication to the gateway listener.

Exposure guardrails

Because the open-by-default load balancer is otherwise silent, AICR surfaces it in two places:

  • Bundle-time warning. When a bundle includes agentgateway with an unscoped allowedSourceRanges — empty, or including an any-source CIDR such as 0.0.0.0/0 or ::/0aicr bundle prints a warning that the inference-gateway will be provisioned open to 0.0.0.0/0, with the remediation above. This mirrors the existing storage-class PVC warning and does not block bundle generation.
  • Conformance check. The inference-gateway conformance check (run during aicr validate --phase conformance on a live cluster) inspects the gateway’s LoadBalancer Service and records its exposure as evidence — the source ranges if scoped, or an explicit “open to 0.0.0.0/0” finding if not. By default an open gateway is a non-fatal warning (open-by-default is intentional). Set AICR_REQUIRE_SCOPED_INFERENCE_GATEWAY=true on the validator environment to escalate an open gateway to a check failure (fail-closed policy).

Adding Components

New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Components docs for details.

Upgrade Notes

Migration steps when upgrading from a prior AICR-generated bundle to a newer one that changes how a component delivers its Kubernetes resources.

A generated recipe is a point-in-time artifact of the AICR binary that produced it: the embedded registry, overlays, manifest paths, and chart pins are part of that binary’s surface. When upgrading AICR, regenerate the recipe from scratch with the new binary (aicr recipe ...) before re-bundling. aicr bundle --recipe <old-file> against a newer binary may fail if the saved recipe references manifest paths the new release has moved or removed (see Bundle Generation Fails for the specific error).

gpu-operator: dcgm-exporter ConfigMap moved into the main release

Earlier bundles shipped the dcgm-exporter ConfigMap as a post-manifest in a separate Helm release named gpu-operator-post. The in-cluster ConfigMap therefore carries ownership annotations pointing at that release:

1meta.helm.sh/release-name: gpu-operator-post
2meta.helm.sh/release-namespace: gpu-operator

Newer bundles render the ConfigMap directly from the main gpu-operator chart’s dcgmExporter.config.data values. On upgrade, Helm 3 refuses to claim the existing ConfigMap because its annotations point at a different release:

Error: ConfigMap "dcgm-exporter" in namespace "gpu-operator" exists and cannot be
imported into the current release: invalid ownership metadata; annotation
validation error: key "meta.helm.sh/release-name" must equal "gpu-operator":
current value is "gpu-operator-post"

Fresh installs are not affected. To migrate an existing cluster, remove the stale gpu-operator-post release before applying the new bundle.

Raw Helm (per-component bundle / deploy.sh):

$helm uninstall gpu-operator-post --namespace gpu-operator

helm uninstall removes the ConfigMap it owns; the next gpu-operator upgrade re-creates it from values.

Helmfile — the new bundle no longer references gpu-operator-post, so helmfile apply will not prune it on its own. Run the helm uninstall above first, then helmfile apply.

Argo CD — delete the stale Application (it will not self-prune unless an ApplicationSet was managing it), then sync the updated gpu-operator application:

$argocd app delete gpu-operator-post --cascade

Flux — delete the stale HelmRelease so Flux uninstalls the release and removes the ConfigMap, then reconcile the updated gpu-operator HelmRelease. The example below assumes the Flux control plane runs in flux-system; substitute the namespace where your Flux installation lives:

$kubectl delete helmrelease gpu-operator-post --namespace flux-system

After migration, confirm the ConfigMap is owned by the gpu-operator release:

$kubectl get configmap dcgm-exporter -n gpu-operator \
> -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}'
$# Expected: gpu-operator