This page is the architecture entry point for AICR contributors and maintainers. Its primary job is to set boundaries: what AICR is, what it isn’t, and what kinds of contributions belong here.
For task-level guides see the per-component pages: CLI, API server, validator, component validations, bundlers, recipes and data.
AICR is a design-time tool. Given a description of a target environment, it generates validated GPU-cluster configuration artifacts that an established deployment tool — Helm, Argo CD, Flux — consumes.
Core artifacts:
Each stage produces a serializable artifact (file or ConfigMap) and is independently invocable. Reproducibility — same inputs, same outputs — is non-negotiable.
AICR is not a deployment engine. It does not:
kubectl applyThese responsibilities belong to the deployment tool that consumes AICR’s artifacts (e.g. Helm, Argo CD, Flux). These tools own release reconciliation and lifecycle.
A note on terminology: code under pkg/bundler includes things we
call deployers. They are output adapters that emit artifacts in
a tool-specific format (Helm bundle, Argo CD Application). They do
not perform deployment.
The only in-cluster component is the snapshot agent — a one-shot Kubernetes Job that captures cluster state into a ConfigMap and exits. It is an input collector, not a long-lived runtime component, and is not part of the deployed system.
These boundaries are the primary review criterion for contributions. A change that crosses them is a re-architecture, not a feature, and should be discussed in an issue or ADR before code is written.
In scope — artifact generation:
Out of scope — deployment-time concerns:
kubectl apply plus bespoke wait scripts and custom uninstall
logic)If a feature requires AICR to keep running after artifact generation, or
to drive kubectl and direct API calls to deploy what it produced, it
belongs in a deployment tool — not AICR.
AICR emits artifacts in formats consumed by community-standard deployment tools. We target what the community already uses rather than rolling our own:
values.yaml and an install scriptApplication manifests with sync-wave orderingWe are open to adding support for additional community-standard targets (Flux, Helmfile, Kustomize) when there is demonstrated demand. We do not add custom or proprietary deployment mechanisms: they create unsustainable maintenance burden without serving the broader community, and they pull deployment-time orchestration into AICR — the boundary we are explicitly maintaining.
Validated configuration exists independent of how it is rendered, packaged, or deployed. Correctness must not be coupled to a specific tool, workflow, or delivery mechanism.
Given the same inputs, the same system version must always produce the same result. This rules out hidden state, implicit defaults, and non-deterministic behavior — all of which would break the trust model that downstream consumers rely on.
More specific recipes must never be matched unless explicitly requested. Generic intent cannot silently resolve to specialized configurations. This preserves user control and prevents accidental misconfiguration.
Trust is established through evidence, not assertions. Every released artifact must carry verifiable, non-falsifiable proof of where it came from and how it was produced. See SECURITY.md for SLSA, SBOM, and attestation details.
The system must integrate into how users already work. AICR provides validated configuration, not a new operational model. If adoption requires retraining users on “the right way,” the design has failed.
Stages can be invoked individually or chained. Inputs and outputs flow
through files, stdout, or Kubernetes ConfigMaps (cm://namespace/name
URI), which lets the snapshot agent hand off to a CLI or API server
running outside the cluster. Detail per stage lives in the
CLI and API server pages.
Critical separation: pkg/cli and pkg/server are user-interaction
packages — they capture intent, validate input, and format output. All
business logic lives in functional packages (composed by the
pkg/client/v1 facade) so both entry points share it. Adding business
logic to pkg/cli or pkg/server handlers is a boundary violation.
errgroupCollectors run in parallel under errgroup.WithContext. Failure of any
collector cancels the rest via context. Fail-fast is the default;
best-effort partial collection would hide systemic problems behind
partial data and is intentionally not supported.
Collectors implement a common interface and self-register. Adding a new state source does not modify existing collectors.
The recipe store is read-only after initialization (sync.Once).
Mutations happen on per-request clones. This avoids locks and makes
the API server safe for concurrent requests.
pkg/k8s/client caches a single clientset across the process to avoid
connection exhaustion. Both in-cluster and out-of-cluster (kubeconfig)
authentication are supported transparently.
Long-running Kubernetes operations (waiting for a Job) use the watch
API rather than polling loops. See pkg/k8s/pod.
All errors flow through pkg/errors with a typed code. The HTTP
layer maps codes to status; the CLI maps codes to exit codes. Wrapping
rules and error code semantics live in
CLAUDE.md.
AICR can be invoked in three shapes. None of them are runtime components in the deployed cluster — all are design-time tooling.
Single binary. Local development, CI pipelines, troubleshooting.
Stateless HTTP service for programmatic recipe and bundle generation. Multi-tenant deployments scale horizontally behind a load balancer. The server returns artifacts; it does not deploy them. Endpoints, middleware, and operational details live in /aicr/contributor-guide/api-server.
A Kubernetes Job that runs once on a target cluster, captures state
into a ConfigMap, and exits. The CLI or API server reads the ConfigMap
(cm://namespace/name URI) as input to recipe generation or
validation. The Job is not a controller, has no reconcile loop, and is
not part of the deployed system.
Reproducibility requires that failures during artifact generation be
explicit, not silent. See pkg/errors for code semantics.
errgroup. The whole snapshot
fails. Best-effort mode is intentionally not the default.client-go/util/retry. After exhaustion, return a
structured error; do not synthesize fake measurements.HTTP-layer failure handling (rate limiting, graceful shutdown, panic recovery) lives in /aicr/contributor-guide/api-server. Supply-chain and CI failure handling lives in CONTRIBUTING.md.