AI Cluster Runtime (AICR)

View as Markdown

NVIDIA AI Cluster Runtime (AICR) generates validated, reproducible configuration artifacts for GPU-accelerated Kubernetes clusters. Given a description of your environment — cloud, accelerator, OS, intent — AICR emits the Helm, Argo CD, Flux, or Helmfile artifacts your deployment tool consumes. The output is hardware-aware, version-locked, and backed by SLSA Level 3 provenance.

For the project pitch, supported environments, and a feature overview, see the repository README.

Find Your Path

If you are a…Start here
User — operator deploying AICR to provision or validate a clusterUser Guide
Integrator — engineer embedding AICR in a CI/CD pipeline, GitOps flow, or larger platformIntegrator Guide
Contributor — developer extending AICR or shipping recipesContributor Guide

User Guide

For operators running aicr against real clusters.

TopicDoc
Install the CLIInstallation
Full workflow, start to finishEnd-to-End Tutorial
Every command and flagCLI Reference
Render a recipe into deployment artifactsGenerating Bundles
REST API for aicrdAPI Reference
Run the snapshot agent in-clusterAgent Deployment
Validate a recipe against a live clusterValidation
Components that can appear in a recipeComponent Catalog
Air-gapped mirroringAir-Gap Mirror

Integrator Guide

For pipelines and platforms that call AICR programmatically or host aicrd.

TopicDoc
CI/CD integration patternsAutomation
Self-host aicrd on KubernetesKubernetes Deployment
Add or modify recipe metadataRecipe Development
Verify artifacts (SLSA, SBOM, attestations)Supply Chain Verification
Ship custom validators via --dataValidator Extension
Cloud-specific GPU setupAKS, EKS networking, GKE networking, Talos

Contributor Guide

For developers working on AICR itself.

TopicDoc
Architecture, boundaries, package mapArchitecture Overview
Recipes, overlays, mixinsRecipes
Adding a componentComponents
Adding a snapshot collectorCollectors
All four validation surfacesValidators
CLI internalsCLI
API server internalsAPI Server
Testing surfaces and the make qualify gateTesting
Release runbookMaintaining AICR

The Four-Stage Workflow

┌──────────┐ ┌────────┐ ┌──────────┐ ┌────────┐
│ Snapshot │───▶│ Recipe │───▶│ Validate │───▶│ Bundle │
└──────────┘ └────────┘ └──────────┘ └────────┘
capture generate check emit
cluster optimized constraints deployment
state config vs. actual artifacts

Each stage produces a serializable artifact and is independently invocable. Stages can be chained or run standalone, and inputs and outputs flow through files, stdout, or Kubernetes ConfigMaps (cm://namespace/name). For the CLI walkthrough see CLI Reference; for the architecture see /aicr/contributor-guide/architecture-overview.

Glossary

Reference for the terms used across the docs site.

TermDefinition
SnapshotCaptured state of a target system (OS, kernel, Kubernetes, GPU, SystemD). Produced by aicr snapshot or the in-cluster snapshot Job.
RecipeResolved configuration spec — component refs, constraints, deployment order — produced by aicr recipe from criteria or from a snapshot.
CriteriaQuery parameters that select a recipe: service, accelerator, intent, os, platform, nodes.
OverlayA recipe metadata file (kind: RecipeMetadata) under recipes/overlays/ matched by criteria. Composes via single-parent inheritance (spec.base).
MixinA composable fragment (kind: RecipeMixin) under recipes/mixins/ carrying only constraints and componentRefs, referenced via spec.mixins.
BundleDeployment artifacts emitted by aicr bundle: Helm values, manifests, install scripts, checksums.
BundlerA per-component generator that emits the bundle inputs (e.g., GPU Operator bundler).
DeployerAn output adapter that serializes a bundle in a tool-specific format: helm, helmfile, argocd, argocdhelm, flux.
ComponentA deployable software package (e.g., GPU Operator, Network Operator). Lives in recipes/registry.yaml.
ComponentRefA reference to a component inside a recipe — version, source, values file, dependencies.
ConstraintA declarative validation rule on a recipe (e.g., K8s.server.version >= 1.32.4).
Validation PhaseA stage of aicr validate: readiness (always implicit), deployment, performance, conformance.
MeasurementA snapshot data point keyed by type (K8s, OS, GPU, SystemD), subtype, and reading.
SpecificityA score counting non-any criteria fields. More-specific overlays merge later.
Asymmetric matchingCriteria-matching rule: recipe any is a wildcard; query any does not match a specific recipe.
ConfigMap URIcm://namespace/name — read or write snapshots and recipes directly to Kubernetes ConfigMaps.
SLSA / SBOMSupply-chain Levels for Software Artifacts (releases reach Build Level 3) and Software Bill of Materials shipped with binaries and images.