CLI Reference
Complete reference for the aicr command-line interface.
Overview
AICR provides a four-step workflow for optimizing GPU infrastructure:
Step 1: Capture system configuration Step 2: Generate optimization recipes Step 3: Validate constraints against cluster Step 4: Create deployment bundles
Global Flags
Available for all commands:
Logging Modes
AICR supports three logging modes:
-
CLI Mode (default): Minimal user-friendly output
- Just message text without timestamps or metadata
- Error messages display in red (ANSI color)
- Example:
Snapshot captured successfully
-
Text Mode (
--debug): Debug output with full metadata- Key=value format with time, level, source location
- Example:
time=2025-01-06T10:30:00.123Z level=INFO module=aicr version=v1.0.0 msg="snapshot started"
-
JSON Mode (
--log-json): Structured JSON for automation- Machine-readable format for log aggregation
- Example:
\{"time":"2025-01-06T10:30:00.123Z","level":"INFO","msg":"snapshot started"\}
Examples:
Commands
aicr snapshot
Capture comprehensive system configuration including OS, GPU, Kubernetes, and SystemD settings.
Synopsis:
Flags:
Output Destinations:
- stdout: Default when no
-oflag specified - File: Local file path (
/path/to/snapshot.yaml) - ConfigMap: Kubernetes ConfigMap URI (
cm://namespace/configmap-name)
What it captures:
- SystemD Services: containerd, docker, kubelet configurations
- OS Configuration: grub, kmod, sysctl, release info
- Kubernetes: server version, images, ClusterPolicy
- GPU: driver version, CUDA, MIG settings, hardware info
- NodeTopology: node topology (cluster-wide taints and labels across all nodes)
Examples:
Snapshot Config File Mode
Drive aicr snapshot from an AICRConfig document so the snapshot inputs version-control alongside the recipe, bundle, and validate steps in an end-to-end workflow.
Precedence: a CLI flag always wins over the matching config field. Selectors and tolerations omitted entirely inherit the snapshotter’s compiled-in defaults (tolerations defaults to tolerate all taints); an explicit empty list (tolerations: []) clears the tolerate-all default — the same nil-vs-empty semantics used by spec.validate.agent.
Custom Templates
The --template flag enables custom output formatting using Go templates with Sprig functions. Templates receive the full Snapshot struct:
Example template extracting key cluster info:
See examples/templates/snapshot-template.md.tmpl for a complete example template that generates a concise cluster report.
Agent Deployment Mode
When running against a cluster, AICR deploys a Kubernetes Job to capture the snapshot:
- Deploys RBAC: ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding
- Creates Job: Runs
aicr snapshotas a container on the target node - Waits for completion: Monitors Job status with configurable timeout
- Retrieves snapshot: Reads snapshot from ConfigMap after Job completes
- Writes output: Saves snapshot to specified output destination
- Cleanup: Deletes Job and RBAC resources (use
--no-cleanupto keep for debugging)
Benefits of agent deployment:
- Capture configuration from actual cluster nodes (not local machine)
- No need to run kubectl manually
- Programmatic deployment for automation/CI/CD
- Reusable RBAC resources across multiple runs
Agent deployment requirements:
- Kubernetes cluster access (via kubeconfig)
- Cluster admin permissions (for RBAC creation)
- GPU nodes with nvidia-smi (for GPU metrics)
ConfigMap Output
When using ConfigMap URIs (cm://namespace/name), the snapshot is stored directly in Kubernetes:
Snapshot Structure:
aicr recipe
Generate optimized configuration recipes from query parameters or captured snapshots.
Synopsis:
Modes:
Config File Mode (Recommended)
Generate recipes using an AICRConfig document. The same file format also drives the bundle command, so a single file can describe an end-to-end recipe-to-bundle workflow.
Flags:
The config file uses a Kubernetes-style envelope:
Individual CLI flags always override config file values. For slice/map flags, presence on the CLI replaces the file’s value (no append).
--config accepts a local file path or an HTTP/HTTPS URL. ConfigMap (cm://) sources are not supported; export the data with kubectl get cm <name> -o yaml and pass the resulting file.
Query Mode
Generate recipes using direct system parameters:
Flags:
Service / Accelerator / OS / Intent / Platform value listings above are the OSS-embedded set. When
--dataregisters additional values (e.g., undisclosed providers, proprietary platforms), the CLI admits them at runtime through the criteria registry — see Data Extension.--criteria-strictrestores the OSS-only set regardless of what--datacontributes.
Examples:
Snapshot Mode
Generate recipes from captured snapshots:
Flags:
Snapshot Sources:
- File: Local file path (
./snapshot.yaml) - URL: HTTP/HTTPS URL (
https://example.com/snapshot.yaml) - ConfigMap: Kubernetes ConfigMap URI (
cm://namespace/configmap-name)
Examples:
Output structure:
aicr query
Query a specific value from the fully hydrated recipe configuration. Resolves a recipe
from criteria (same as aicr recipe), merges all base, overlay, and inline value
overrides, then extracts the value at the given dot-path selector.
Synopsis:
Flags:
All aicr recipe flags are supported, plus:
Selector Syntax
Uses dot-delimited paths consistent with Helm --set and yq:
Leading dots are optional (yq-style): .components.gpu-operator.chart and
components.gpu-operator.chart are equivalent.
Output:
- Scalar values (string, number, bool) are printed as plain text — no YAML wrapper
- Complex values (maps, lists) are printed as YAML (default) or JSON (
--format json)
Examples:
Advanced Examples:
aicr validate
Validate a system snapshot against the constraints defined in a recipe to verify cluster compatibility. Supports multi-phase validation with different validation stages.
For a task-oriented walkthrough (capture snapshot → generate recipe → run each phase, with worked training and inference examples), see Validation.
Synopsis:
Flags:
Input Sources:
- File: Local file path (
./recipe.yaml,./snapshot.yaml) - URL: HTTP/HTTPS URL (
https://example.com/recipe.yaml) - ConfigMap: Kubernetes ConfigMap URI (
cm://namespace/configmap-name)
Validation Phases
Validation can be run in different phases to validate different aspects of the deployment:
Note: Readiness constraints (K8s version, OS, kernel) are always evaluated implicitly before any phase runs. If readiness fails, validation stops before deploying any Jobs.
Deployment phase checks:
The deployment phase verifies that the cluster is actually ready for GPU workloads — not just that install commands returned successfully. It covers:
- Enabled component namespaces are
Active. - Declared
expectedResources(Deployments, DaemonSets, etc.) exist and are healthy. - When
nodewright-customizationsis enabled: every Skyhook CR the recipe declares reportsstatus.status == complete. The set of expected CR names is extracted from the recipe’s ownComponentRef.ManifestFilesfor this component, so unrelated Skyhook CRs on the cluster (stale from prior deploys, or owned by another tenant) are ignored. If no Skyhook names can be extracted from thoseManifestFiles, deployment validation fails closed as a recipe/configuration error instead of skipping. - When
gpu-operatoris enabled:ClusterPolicyreportsstatus.state == ready. - When
nvidia-dra-driver-gpuis enabled: the kubelet-plugin DaemonSet is ready. Discovery is by the upstream chart’s role-suffix convention — the validator finds the single DaemonSet in the component namespace whose name ends in-kubelet-plugin, so customfullnameOverridevalues are handled automatically.
Graceful skip: If a component is declared in the recipe but its CRD is not yet registered on the cluster (e.g., fresh cluster, operator chart not installed), the corresponding readiness check is skipped rather than failing. Once the CRD is present, the check runs and a missing CR is treated as a real failure — for example, if the gpu-operator CRD is registered but no ClusterPolicy CR exists, deployment validation fails with a “CR missing” diagnostic rather than silently passing. Other errors still fail closed: an RBAC denial on skyhooks.skyhook.nvidia.com returns HTTP 403 (not a NoMatch), so the validator surfaces it as a failure instead of silently skipping the Skyhook check.
Day-N re-verification: Because this is a read-only check against live cluster state, re-running aicr validate --phase deployment after scale-up, upgrade, or other runtime changes is safe and answers the same “is this cluster ready for GPU workloads now?” question.
Phase Dependencies:
- Phases run sequentially when using
--phase all - If a phase fails, subsequent phases are skipped
- Use individual phases for targeted validation during specific deployment stages
Constraint Format
Constraints use fully qualified measurement paths: \{Type\}.\{Subtype\}.\{Key\}
Supported Operators
Examples:
Validate Config File Mode
aicr validate --config <path> reads inputs from an AICRConfig YAML/JSON file
under spec.validate. CLI flags always override values loaded from --config;
override events are logged at INFO so users can see which input won. The OIDC
identity token used for --push signing stays out of the schema by design
(short-lived tokens must not be committed); the CLI resolves it at sign time
through the precedence chain described on --identity-token.
Supported schema:
Examples:
Workload Scheduling
The --node-selector and --toleration flags control scheduling for validation
workloads — the inner pods that validators create to test cluster functionality
(e.g., NCCL benchmark workers, conformance test pods). They do not affect the
validator orchestrator Job, which runs lightweight check logic and is placed on
CPU-preferred nodes automatically.
When --node-selector is provided, it replaces the platform-specific selectors
that validators use by default:
When --toleration is provided, it replaces the default tolerate-all policy
(operator: Exists) on workloads that need to land on tainted GPU nodes.
Validators that use nodeName pinning (nvidia-smi, DRA isolation test) or
DRA ResourceClaims for placement (gang scheduling) are not affected by these flags.
Output Structure (CTRF JSON):
Results are output in CTRF (Common Test Report Format) — an industry-standard schema for test reporting.
Note: The
testsarray above is truncated for brevity. A full validation run produces one entry per check across all phases. Each entry includesstdoutwith detailed diagnostic output.
Test Statuses:
Exit Codes:
aicr diff
Compare two snapshots field-by-field to surface configuration drift between cluster states. Reports added, removed, and modified readings across every measurement type (K8s, GPU, OS, SystemD, NodeTopology).
Synopsis:
Flags:
Inputs:
- File paths (
./baseline.yaml,/tmp/snap.json) - ConfigMap URIs (
cm://gpu-operator/aicr-snapshot) - Both inputs may mix freely; e.g., a local baseline file vs. a live ConfigMap target.
Output Semantics:
- A nil reading is rendered as the literal
<nil>so it cannot be confused with an empty-string value (""). Both forms surface as drift when one side is nil and the other is a concrete value. - Changes are emitted in deterministic order (sorted by
Path) so the diff is reproducible across runs and machines. - The
Resultenvelope includesbaselineSourceandtargetSource(the supplied paths), achangesarray, and asummarywithadded,removed,modified, andtotalcounts.
Examples:
Exit Codes:
Note on CI gating: A non-zero exit identifies that drift was detected, but doesn’t by itself distinguish drift from malformed input — both map to exit
2. To differentiate without relying on stderr format (text by default; JSON only with--log-json), inspect the diff payload directly: write the result with--output drift.json --format jsonand branch on the presence of the file plus itssummary.totalfield. That signal is format-stable regardless of logging mode.
aicr bundle
Generate deployment-ready bundles from recipes containing Helm values, manifests, scripts, and documentation.
Synopsis:
Flags:
Bundle Config File Mode
The bundle command accepts the same AICRConfig format used by aicr recipe. A single file can populate both spec.recipe and spec.bundle, capturing an end-to-end workflow that can be committed to git, fetched from CI, or shared across environments.
When both spec.recipe.output.path and spec.bundle.input.recipe are set, they must reference the same path; otherwise loading fails fast.
CLI flags always override values loaded from --config. For slice/map flags (--set, --dynamic, --system-node-selector, etc.), CLI presence replaces the config’s value rather than appending. Override events are logged at INFO so users can see which input won.
Secrets: the cosign identity token is never read from a config file; supply it via --identity-token or COSIGN_IDENTITY_TOKEN.
Node Scheduling
The --accelerated-node-selector and --accelerated-node-toleration flags control scheduling for GPU-specific components:
NFD (Node Feature Discovery) workers must run on all nodes (GPU, CPU, and system) to detect hardware features. This matches the gpu-operator default behavior where NFD workers also run on control-plane nodes. The --accelerated-node-selector is intentionally not applied to NFD workers so they are not restricted to GPU nodes.
Note: When no
--accelerated-node-tolerationis specified, a default toleration (operator: Exists) is applied to both GPU daemonsets and NFD workers, allowing them to run on nodes with any taint.
Example:
Cluster node requirements: This example assumes the cluster has nodes labeled
nodeGroup=system-workerwith taintsdedicated=system-workload:NoSchedule,NoExecutefor system infrastructure, and GPU nodes labelednodeGroup=gpu-workerwith taintsdedicated=worker-workload:NoSchedule,NoExecute.
This results in:
- GPU daemonsets (driver, device-plugin, toolkit, dcgm):
nodeSelector=nodeGroup=gpu-worker+ tolerations fordedicated=worker-workloadwith bothNoScheduleandNoExecute - NFD workers: no nodeSelector (runs on all nodes) + tolerations for
dedicated=worker-workloadwith bothNoScheduleandNoExecute - System components (gpu-operator controller, NFD gc/master, dynamo grove, agentgateway proxy):
nodeSelector=nodeGroup=system-worker+ tolerations fordedicated=system-workloadwith bothNoScheduleandNoExecute
Behavior:
- All components from the recipe are bundled automatically
- Each component creates a subdirectory in the output directory
- Components are deployed in the order specified by
deploymentOrderin the recipe
Storage Class
The --storage-class flag injects a Kubernetes StorageClass name into components at bundle time. StorageClass is a cluster infrastructure detail — the right value depends on what the target cluster has provisioned, not on the recipe.
When provided, the value is written to all Helm value paths declared in the component registry under storageClassPaths, overriding any storageClassName set in recipe overlays. If a per-component --set <component>:<path>=<value> explicitly targets the same path, that value takes precedence over --storage-class.
Example:
When --storage-class is not set, any storageClassName values already defined in the recipe overlays are preserved as defaults. When it is set, --set <component>:<path>=<value> on the same path still wins — --storage-class only fills in paths that were not explicitly overridden.
If a rendered component creates a PVC at a registry-declared storageClassPaths entry and no usable storageClassName is set after overlay, --storage-class, and --set precedence is resolved, aicr bundle emits a non-blocking warning. The bundle still relies on the target cluster’s default StorageClass in that case.
Deployment Methods
The --deployer flag controls how deployment artifacts are generated:
Note:
--dynamicis not supported with--deployer argocd. Use--deployer argocd-helminstead, which produces a Helm chart where all values are overridable at install time.
Deployment Order:
All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:
- Helm: Components listed in README in deployment order
- Argo CD: Uses
argocd.argoproj.io/sync-waveannotation (0 = first, 1 = second, etc.) - Flux: Uses
dependsOnreferences in HelmRelease CRs (each component depends on the previous component’s terminal release — its<prev>-postrelease when post-manifests are present, otherwise<prev>). Components with pre-manifests insert a<name>-prerelease that the primary HelmRelease depends on, so the chain becomesprevious → <name>-pre → <name> → <name>-post → next. The bundle’s rootkustomization.yamlis a plain Kustomize file (not a Flux Kustomization CR). - Helmfile: Uses
needs:references in each release (each component depends on its predecessor)
Value Overrides
Override any value in the generated bundle files using dot notation:
Format: bundler:path=value where:
bundler- Bundler name (e.g.,gpuoperator,networkoperator,certmanager,nodewright-operator,nvsentinel)path- Dot-separated path to the fieldvalue- New value to set
Behavior:
- Duplicate keys: When the same
bundler:pathis specified multiple times, the last value wins - Array values: Individual array elements cannot be overridden (no
[0]index syntax). Arrays can only be replaced entirely via recipe overrides, not via--setflags. Use recipe-level overrides incomponentRefs[].overridesif you need to replace an entire array. - Type conversion: String values are automatically converted to appropriate types (
true/false→ bool, numeric strings → numbers) - Component enable/disable: The special
enabledkey controls whether a component is included in the bundle.--set <component>:enabled=falseexcludes the component;--set <component>:enabled=truere-enables a recipe-disabled component. Theenabledkey is consumed by the bundler and not passed to Helm chart values.
Examples:
Vendoring Charts for Air-Gap
The --vendor-charts flag pulls upstream Helm chart bytes into the bundle at bundle time. With the flag set, every Helm-typed component becomes a local chart inside the generated bundle and the resulting artifact deploys end-to-end with zero registry egress. Without the flag, deploy-time helm upgrade --install calls fetch from the upstream repository — which works for connected clusters but breaks in air-gapped environments.
Bundle-time requirement: the helm binary must be on $PATH when aicr bundle --vendor-charts runs. Authentication for private chart registries flows through Helm’s own conventions:
- HTTP(S) repositories —
HELM_REPOSITORY_USERNAME/HELM_REPOSITORY_PASSWORDenvironment variables. - OCI registries — standard docker config (
~/.docker/config.jsonor$DOCKER_CONFIG); rundocker login <registry>ahead of time.
Tradeoff: CVE-yank fail-loud signal is lost. Non-vendored bundles fail loudly when an upstream chart version is yanked at registry time, which prompts a rebundle with a fixed recipe. Vendored bundles freeze the chart bytes at bundle creation and silently install the frozen version even after upstream yank. Treat provenance.yaml (below) as the audit surface for cross-referencing yank lists.
Bundle-time costs. Vendoring adds bundle-time network egress (the chart pull), bundle-time auth surface (private registries need credentials at the bundle host), and bundle size (typically 0.5–5 MB unpacked per chart). Users who don’t need air-gap shouldn’t set --vendor-charts and shouldn’t pay these costs.
Bundle layout with --vendor-charts — every Helm component emits a single wrapper folder (mixed components no longer split into a primary + -post pair):
provenance.yaml sits at the bundle root and lists one entry per vendored chart, using the same K8s-style apiVersion/kind shape as the rest of AICR’s persisted formats:
The sha256 field is the digest of the bytes copied into charts/, suitable for yank-list lookups and cross-bundle drift comparisons. Pipe through yq -o=json provenance.yaml if your scanner expects JSON.
Examples:
Dynamic Install-Time Values
The --dynamic flag declares value paths that are cluster-specific and should be provided at install time rather than baked into the bundle at build time. This enables building a single bundle that can be deployed to multiple clusters with different configurations.
Use --dynamic for values that genuinely vary per cluster — cluster names, subnet IDs, endpoint URLs, region-specific settings. For values that are static per bundle but differ from the recipe default (e.g., a specific driver version), use --set instead.
Attestation scope: Dynamic values are supplied at install time and are not covered by
--attest. Attestation binds the shipped bundle (defaults and stubs), not operator-provided overrides. If you need to constrain dynamic values at deploy time, use admission control or Argo sync hooks — see Attestation Scope.
Format: component:path where:
component- Component name or override key (same keys as--set, e.g.,gpuoperator,alloy)path- Dot-separated path to the value that varies per cluster
Helm deployer behavior:
Dynamic paths are removed from values.yaml and written to a separate cluster-values.yaml per component. The generated deploy.sh passes both files to Helm:
Before deploying, fill in cluster-values.yaml with cluster-specific values.
Argo CD deployer behavior:
The --deployer argocd-helm generates a Helm chart app-of-apps where all values are overridable at install time. Static values are baked into the chart as files; dynamic overrides are merged on top at render time. Use --dynamic to pre-populate specific paths in the root values.yaml:
Examples:
Bundle structure with --dynamic (Helm deployer):
Bundle structure with --dynamic (Flux deployer):
The --deployer flux bundle uses Flux’s native spec.valuesFrom to reference ConfigMaps containing dynamic values. Dynamic paths are removed from the inline spec.values and placed into a ConfigMap per component. Flux merges valuesFrom first, then inline values on top — since dynamic paths are stripped from inline values, the ConfigMap values take effect without conflicts.
Before applying the bundle to your cluster, edit each configmap-values.yaml with the correct per-cluster values:
Bundle structure with --dynamic (Helmfile deployer):
The --deployer helmfile bundle references both values.yaml (static) and cluster-values.yaml (dynamic stubs) per release. helmfile merges value files in declaration order, so cluster-values.yaml overrides on top of the generated values.yaml. Edit cluster-values.yaml per component before helmfile apply:
Argo CD Helm chart structure with --dynamic:
The --deployer argocd-helm bundle is itself a Helm chart whose templates/ create per-component Argo Applications. Each application’s helm.values block merges static values (loaded via .Files.Get for upstream-helm components, or read from the wrapped chart’s own values.yaml for local-chart components) with dynamic overrides from the parent chart’s .Values.
The same uniform NNN-<component>/ folder layout used by --deployer argocd is included at the bundle root so that path-based Argo Applications (manifest-only, kustomize-wrapped, mixed -post) can resolve their path: references against the OCI-published bundle.
Manifest-only components and mixed-component raw manifests are supported by --deployer argocd-helm via the path-based Application shape.
The bundle is URL-portable. No --repo flag is needed (and is ignored if passed with --deployer argocd-helm). The same generated bundle bytes can be pushed to any chart-source backend the user chooses — Argo CD pulls from whichever URL the user supplies at install time via helm install --set repoURL=.... The publish location is not baked into the bundle artifact.
Recommended deploy flow:
The chart’s templates/aicr-stack.yaml renders the parent Argo Application with .Values.repoURL and .Values.targetRevision substituted in. The parent Application then triggers Argo to render the chart again from the OCI source, creating the per-component child Applications with sync-wave ordering preserved. Child Applications whose source is path-based (manifest-only and mixed-component -pre / -post folders) inherit .Values.repoURL and append .Chart.Name so they pull from the same published artifact as the parent.
Argo CD OCI prerequisites. Path-based child Applications use Argo CD’s generic OCI artifact source type (introduced in Argo CD v2.13). The argocd-helm bundle therefore requires:
- Argo CD ≥ v2.13 on the target cluster.
- A registry that serves Helm-pushed OCI artifacts through the generic OCI manifest fetch path (most modern registries — ECR, GHCR, GAR, Harbor, Artifactory, plain
oras-compatible registries — support this).
If the recipe is pure-Helm (no manifest-only / mixed components), path-based children are not exercised and the bundle can work on Argo CD versions older than v2.13. If path-based children are present, Argo CD v2.13+ is required. See the troubleshooting section below if Failed to load target state appears on aicr-stack or any <component>-pre / <component>-post Application.
helm install ./bundle from a local directory also works, but with a caveat: child Applications whose source is path-based require Argo’s repo-server to fetch the bundle from a remote (git or OCI) — there is no local-filesystem source type for an Argo Application. Local helm install is therefore end-to-end only when the recipe contains pure-Helm components. For everything else, publish first.
Bundle structure (with default Helm deployer):
Folder layout rules:
- Folders are numbered
NNN-<component>/(1-based, zero-padded). Numbering is regenerated on every bundle. - Each folder is one of two kinds, distinguished by the presence of
Chart.yaml:- upstream-helm — no
Chart.yaml;upstream.envcarriesCHART/REPO/VERSION;install.shinstalls the upstream chart. - local-helm —
Chart.yaml+templates/;install.shinstalls the local chart (helm upgrade --install <name> ./).
- upstream-helm — no
- Mixed components (Helm chart + raw manifests) emit two adjacent folders: a primary upstream-helm
NNN-<name>/and an injected(NNN+1)-<name>-post/local-helm wrapper carrying the raw manifests. Subsequent components shift by one. - Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart.
- Kustomize-typed components run
kustomize buildat bundle time; the output becomes a singletemplates/manifest.yamlinside a local-helm folder.
Breaking change vs. earlier releases:
Previous releases used a flat <component>/ layout with manifests/ siblings and a --deployer helm script that branched on component kind. The new format is uniform:
- All folders carry a rendered
install.sh. The top-leveldeploy.shis a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it. - Raw manifests for mixed components now apply post-install only, via the injected
-postwrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively. - Tooling that parsed bundle paths by bare component name must account for the
NNN-prefix.
Argo CD bundle structure (with --deployer argocd):
The argocd deployer uses the same uniform NNN-<component>/ folder layout as --deployer helm. Each folder carries an application.yaml whose Application shape is decided by the folder kind:
Chart.yamlabsent (KindUpstreamHelm — pure Helm components): today’s multi-source Application pointing at the upstream Helm repository plus a values $ref to the user’s git repo. Unchanged for current users.Chart.yamlpresent (KindLocalHelm — manifest-only, kustomize-wrapped, mixed-post): single-source path-based Application withsource.path: NNN-<name>against the user’s repo.
The argocd deployer emits only what Argo CD’s repo-server consumes: application.yaml, values.yaml (multi-source helm.valueFiles for upstream-helm, or local-chart Helm rendering for KindLocalHelm), and Chart.yaml/templates/ for KindLocalHelm. The helm-deployer orchestration files (install.sh, upstream.env, cluster-values.yaml) are stripped — Argo doesn’t run shell scripts or source shell env, and --dynamic is rejected with --deployer argocd (use --deployer argocd-helm for install-time values).
Manifest-only components (e.g., nodewright-customizations) and mixed-component raw manifests (the -post injection) are now deployed by --deployer argocd. Previously they were silently dropped. Set --repo <user-git-or-oci> to populate the repoURL on path-based Applications so Argo can resolve them.
Day 2 Options:
The --workload-gate and --workload-selector flags are day 2 operational options for cluster scaling operations:
-
--workload-gate: Specifies a taint for nodewright-operator’s runtime required feature. This ensures nodes are properly configured before workloads can schedule on them during cluster scaling. The taint is configured in the nodewright-operator Helm values file atcontrollerManager.manager.env.runtimeRequiredTaint. For more information about runtime required, see the Nodewright documentation. -
--workload-selector: Specifies a label selector for nodewright-customizations to prevent nodewright from evicting running training jobs. This is critical for training workloads where job eviction would cause significant disruption. The selector is set in the Skyhook CR manifest (tuning.yaml) in thespec.workloadSelector.matchLabelsfield.
Estimated node count (--nodes):
The --nodes flag is a bundle-time option: it is applied when you run aicr bundle, not when you run aicr recipe. The value is written to each component’s Helm values at the paths declared in the registry under nodeScheduling.nodeCountPaths.
- When to use: Pass the expected or typical number of GPU nodes (e.g. size of your node pool). Use
0(default) to leave the value unset. - Where it goes: Components that define
nodeCountPathsin the registry receive the value at those paths in their generatedvalues.yaml. - Example:
aicr bundle -r recipe.yaml --nodes 8 -o ./bundleswrites8to every path listed in each component’snodeScheduling.nodeCountPaths.
Component Validation System:
AICR includes a component-driven validation system that automatically checks bundle configuration and displays warnings or errors during bundle generation. Validations are defined in the component registry and run automatically when components are included in a recipe.
How Validations Work:
- Automatic Execution: When generating a bundle, validations are automatically executed for each component in the recipe
- Condition-Based: Validations can be configured to run only when specific conditions are met (e.g., intent, service, accelerator)
- Severity Levels: Each validation can be configured as a “warning” (non-blocking) or “error” (blocking)
- Custom Messages: Each validation can include an optional detail message that provides actionable guidance
Validation Warnings:
When generating bundles with nodewright-customizations enabled, validation warnings are displayed for missing configuration:
- Workload Selector Warning: When nodewright-customizations is enabled with training intent, if
--workload-selectoris not set, a warning will be displayed:
- Accelerated Selector Warning: When nodewright-customizations is enabled with training or inference intent, if
--accelerated-node-selectoris not set, a warning will be displayed:
Viewing Validation Warnings:
Validation warnings are displayed in the bundle output after successful generation:
Resolving Validation Warnings:
To resolve the warnings, include the appropriate flags when generating the bundle:
Examples:
Argo CD Applications use multi-source to:
- Pull Helm charts from upstream repositories
- Apply values.yaml from your GitOps repository
- Deploy additional manifests from component’s manifests/ directory (if present)
Flux OCI Mode
When using --deployer flux with OCI output (--output oci://...), AICR generates ArtifactGenerator and ExternalArtifact CRs instead of GitRepository sources for local-chart components. This allows Flux to reconcile HelmReleases directly from OCI artifacts without a Git repository.
Prerequisites (Flux v2.7+):
- source-watcher controller must be deployed (
source.extensions.fluxcd.io). This controller watches ArtifactGenerator CRs and creates ExternalArtifact objects. - ExternalArtifact=true feature gate must be enabled on helm-controller. This allows HelmRelease CRs to reference ExternalArtifact objects via
spec.chartRef.
Without both prerequisites, bundles generate successfully but HelmReleases will not reconcile at deploy time.
Configuration flags:
The generated ArtifactGenerator CRs extract per-component chart directories from the outer OCIRepository into ExternalArtifact objects. Each HelmRelease then references the ExternalArtifact via spec.chartRef instead of the traditional spec.chart.spec.sourceRef pointing at a GitRepository.
Bundle Attestation
Prerequisite: The
--attestflag requires a binary installed using the install script, which includes a cryptographic attestation from NVIDIA. Binaries installed viago installor manual download do not include this file and cannot use--attest.
When --attest is passed, the bundle command performs five steps:
- Verifies the binary attestation file exists — The running
aicrbinary must have a valid SLSA provenance file (aicr-attestation.sigstore.json) alongside it, included by the install script from a release archive. If missing, the command fails immediately with guidance on how to install correctly. - Acquires an OIDC token — see OIDC Token Sources below.
- Verifies the binary’s own attestation — Cryptographically verifies the SLSA provenance binds to the running binary and was signed by NVIDIA CI. This ensures only NVIDIA-built binaries can produce attested bundles.
- Signs the bundle — Creates a SLSA Build Provenance v1 in-toto statement binding the creator’s identity to the bundle content (via
checksums.txtdigest) and the binary that produced it. - Writes attestation files —
attestation/bundle-attestation.sigstore.jsonandattestation/aicr-attestation.sigstore.jsonare added to the bundle output.
Attestation is opt-in; bundles are unsigned by default. Signing uses Sigstore keyless signing (Fulcio CA + Rekor transparency log). For verification, see aicr verify.
OIDC Token Sources
--attest resolves an OIDC identity token from the first matching source, in
order:
--identity-tokenflag (orCOSIGN_IDENTITY_TOKENenv) — a pre-fetched token. Use this when a token is obtained out of band (e.g., from a cloud workload-identity exchange or anothercosigninvocation). On shared hosts prefer the env var: a flag value is visible inpsand/proc/<pid>/cmdlineto any user on the same machine.ACTIONS_ID_TOKEN_REQUEST_URL+ACTIONS_ID_TOKEN_REQUEST_TOKEN— the ambient GitHub Actions OIDC credential. Used automatically in CI.--oidc-device-flowflag (orAICR_OIDC_DEVICE_FLOWenv) — OAuth 2.0 Device Authorization Grant (RFC 8628). The CLI prints a verification URL and short code; the user enters the code in a browser on a separate device. Use on headless hosts (bastions, remote build boxes) where the default browser callback cannot reach the machine runningaicr. The host still needs outbound network access to Sigstore’s OIDC and signing endpoints.- Interactive browser flow — opens the default browser and listens on a
random
localhostport for the redirect. Default on workstations.
Both interactive flows time out after 5 minutes.
Attestation works with all deployers (helm, argocd, argocd-helm, flux). External --data files are included in checksums.txt and listed as resolved dependencies in the attestation.
Attestation Scope
Attestation binds the shipped bundle — defaults, dynamic-value stubs, and any external --data files copied into the bundle. It does not bind install-time values supplied via helm --set, a user-provided -f extra.yaml, or Argo Application.spec.source.helm.parameters. That boundary is intentional: dynamic values are the operator’s domain by design.
If you need to enforce specific install-time values (e.g., pinning driver.version), that is a policy concern, not an attestation one. Use admission control (Kyverno, Gatekeeper) or Argo sync hooks to reject deployments that violate the policy. aicr verify checks bundle integrity and provenance; it does not evaluate install-time value constraints.
Deploying a bundle
Note:
deploy.shis a convenience script — not the only deployment path. EachNNN-<component>/folder contains a renderedinstall.shthat runs the exacthelm upgrade --installcommand for manual or pipeline-driven deployment. For teardown, bundles delegate to the deployer-native uninstall path (see Bundle Uninstall below).
Deploy Script Behavior (deploy.sh)
The deploy script installs components in the order specified by deploymentOrder in the recipe.
Flags:
Unknown flags are rejected with an error to catch typos (e.g., --bes-effort or --retires N).
Note on install completion vs. workload readiness. By default,
deploy.shwaits on Helm chart readiness where AICR useshelm --wait. Some components are intentionally installed without Helm chart-level waiting, and the script does not wait for bundle-level workload readiness such as Nodewright node tuning, GPU operator operand rollout (driver, toolkit, device-plugin DaemonSets), or NVIDIA DRA kubelet plugin registration. Those continue asynchronously after the script exits. When--best-effortis used, the script may also finish with non-fatal component failures; check warning lines and logs before treating the install/apply pass as fully successful.--no-waitonly skips the Helm chart-level wait where AICR uses it; it does not affect bundle-level convergence.
Retry behavior:
The deploy script retries failed helm upgrade --install and kubectl apply operations with exponential backoff. By default, each operation is retried up to 5 times (6 total attempts). The backoff delay increases quadratically: 5s, 20s, 45s, 80s, 120s (capped) between retries.
Use --retries 0 to disable retries (fail-fast behavior). When --best-effort is also set, retries are exhausted first before falling through to best-effort handling.
Pre-install manifests and CRD ordering:
Some components have pre-install manifests (CRDs, namespaces, ConfigMaps) that must exist before helm install. The script applies these with kubectl apply before the Helm install. On first deploy, CRD-dependent resources may produce no matches for kind warnings because the CRD hasn’t been registered yet — these warnings are suppressed. All other kubectl apply errors (auth failures, webhook denials, bad manifests) fail the script immediately.
After helm install, the same manifests are re-applied as post-install to ensure CRD-dependent resources are created.
Async components:
Components that use operator patterns with custom resources that reconcile asynchronously (e.g., kai-scheduler) are installed without --wait to avoid Helm timing out on CR readiness.
DRA kubelet plugin registration
After installing nvidia-dra-driver-gpu, the script automatically restarts the DRA kubelet plugin daemonset. This is a best-effort mitigation for a known issue: after uninstall/reinstall, the kubelet’s plugin watcher (fsnotify) may not detect new registration sockets, causing DRA driver gpu.nvidia.com is not registered errors.
If DRA pods fail with this error after redeployment, the daemonset restart alone may not be sufficient — a node reboot is required to reset the kubelet’s plugin registration state. To reboot GPU nodes:
Bundle Uninstall
AICR bundles do not ship a generated undeploy.sh. Teardown is delegated
to the deployer-native uninstall path; AICR’s role ends at design-time
generation. Pick the walkthrough that matches the deployer used to generate
your bundle.
helm
Uninstall releases in reverse deployment order — the same order the
generated README.md lists under ## Uninstall:
Helm intentionally does not delete CRDs (charts that declare them under
crds/ are left in place) or PVCs (StatefulSet-managed volumes are
preserved). Remove them only when you are sure no other release depends on
them:
If a release is stuck in pending-install or pending-upgrade (interrupted
deploy), retry with --no-hooks:
See Helm 3 uninstall docs for the full flag reference.
argocd
Delete the parent Application that owns the bundle’s child Applications
(app-of-apps). AICR does not set the
resources-finalizer.argocd.argoproj.io finalizer on generated
Applications, so a plain kubectl delete removes only the Application CR
and leaves the managed resources running. Use one of the cascade-aware
flows instead:
If you can only use kubectl, add the finalizer first so the controller
performs the cascade for you:
The CRD and PVC notes from the helm walkthrough above still apply:
Argo CD does not run helm uninstall for Helm-templated children — it
renders manifests with helm template and prunes the rendered resources
directly — so CRDs declared under crds/ and PVCs from StatefulSets are
not deleted by the cascade. Remove them by hand if needed.
See ArgoCD app deletion docs for finalizer behavior, cascade modes, and selective deletion.
argocd-helm
Same path as plain argocd: Argo CD uses Helm only to render charts into
Kubernetes manifests (via helm template) and then manages those resources
itself. Deleting the Application with cascade enabled prunes the resources
Argo CD tracks; it does not run helm uninstall, and helm ls will
not show the bundle’s releases.
The kubectl + finalizer-patch fallback from the argocd walkthrough applies here too, and CRD / PVC cleanup follows the helm notes above.
See the Argo CD Helm user guide
and the Argo CD FAQ entry on helm ls
for why Helm CLI tools don’t see Argo-deployed releases.
flux
AICR’s flux bundle emits one HelmRelease per component (plus the
HelmRepository / OCIRepository source objects). Deleting each
HelmRelease from the cluster triggers helm-controller to run
helm uninstall for the underlying release, honoring the chart’s
spec.uninstall settings (disableHooks, keepHistory, etc.):
Delete the bundle’s source objects (HelmRepository / OCIRepository)
after the releases are gone. The CRD / PVC notes from the helm
walkthrough above still apply — helm-controller follows the same
non-destructive defaults.
See the Flux helm-controller uninstall reference
for spec.uninstall field semantics.
helmfile
AICR’s helmfile bundle emits a single helmfile.yaml release graph.
The upstream helmfile CLI handles teardown:
CRD / PVC cleanup follows the helm walkthrough above. See the
Helmfile destroy documentation
for flags and behavior.
aicr mirror list
Discover container images and Helm charts referenced by a recipe for air-gapped
mirroring. Renders each component’s Helm chart with recipe-resolved values and
scans referenced manifests to produce a deduplicated image and chart list. When
the recipe was resolved with --data <dir>, both values and manifests are read
through the overlay so overlay-shadowed paths take precedence over embedded.
For an end-to-end walkthrough covering Hauler and Zarf workflows, see Air-Gapped Mirroring.
Synopsis:
Flags:
Examples:
aicr verify
Verify the integrity and attestation chain of a bundle. Verification is fully offline — no network calls are made.
Synopsis:
Flags:
Trust Levels
Verification steps
- Checksums — verifies all content files match
checksums.txt - Bundle attestation — cryptographic signature verified against Sigstore trusted root
- Binary attestation — provenance chain verified with identity pinned to NVIDIA CI (
on-tag.yamlworkflow)
Examples:
Stale root: If verification fails with certificate chain errors, run
aicr trust updateto refresh the Sigstore trusted root.
aicr evidence digest
Print the canonical sha256 of a resolved recipe — byte-for-byte the same value recorded in predicate.recipe.digest by aicr validate --emit-attestation. The input is resolved through the same recipe builder path as aicr validate -r, so overlays and mixins are hydrated before hashing.
Use this to detect drift between a signed evidence pointer and the current recipe on a PR branch without pulling the OCI artifact.
Synopsis:
Flags:
Exit codes:
Examples:
aicr evidence verify
Verify a recipe-evidence v1 bundle produced by aicr validate --emit-attestation. When the bundle carries a signature, verifies it against the Sigstore trusted root and extracts the cryptographically anchored predicate. Recomputes every file’s sha256 against manifest.json (which the predicate’s manifest.digest field anchors), and surfaces the predicate’s fingerprint, phase counts, and BOM info.
Inline constraint replay is reserved for a follow-up PR.
Synopsis:
The positional argument is auto-detected as one of:
recipes/evidence/<recipe>.yaml— pointer file (verifier fetches the OCI artifact named inside).ghcr.io/<owner>/aicr-evidence@sha256:...oroci://...— OCI reference../out/summary-bundle/(or a parent containing it) — unpacked directory.
Flags:
Exit codes:
The JSON/Markdown output’s exit field (and VerifyResult.Exit from the library API) still distinguishes the two non-zero cases as 1 (recorded phase failures) vs 2 (bundle invalid). Shell consumers can branch via jq '.exit' on --format json output.
Examples:
See demos/evidence.md for a full producer-and-consumer walkthrough.
Stale root: If verification fails with certificate chain errors, run
aicr trust updateto refresh the Sigstore trusted root.
aicr trust update
Fetch the latest Sigstore trusted root from the TUF CDN and update the local cache at ~/.sigstore/root/. This is needed when Sigstore rotates signing keys (a few times per year).
Synopsis:
No flags. This command contacts tuf-repo-cdn.sigstore.dev, verifies the update chain against the embedded TUF root, and writes the result to ~/.sigstore/root/.
When to run:
- After initial installation (the install script runs this automatically)
- When
aicr verifyreports a stale or expired trusted root - When Sigstore announces key rotation
Example:
aicr skill
Generate an AI agent skill file that teaches a coding agent how to use the AICR CLI. The generated file is written to the agent’s standard configuration directory.
Synopsis:
Flags:
Install Locations:
Behavior:
- Without
--stdout: writes the file to disk and prints the path - With
--stdout: prints the generated content to stdout - If the target file already exists: prompts
overwrite? [y/N]when stdin is a terminal; aborts on non-interactive stdin unless--forceis set - Creates parent directories as needed
Examples:
Complete Workflow Examples
File-Based Workflow
ConfigMap-Based Workflow (Kubernetes-Native)
E2E Testing
Validate the complete workflow:
Shell Completion
Generate shell completion scripts:
Installation:
Bash:
Zsh:
Environment Variables
AICR respects standard environment variables:
Exit Codes
Common Usage Patterns
Quick Recipe Generation
Save All Steps
JSON Processing
Multiple Environments
Troubleshooting
Snapshot Fails
Recipe Not Found
Bundle Generation Fails
“helm CLI not found on PATH” with --vendor-charts — the bundle-time vendoring path shells out to helm pull. Install Helm v3 or later (brew install helm / package manager) and re-run, or drop --vendor-charts for a registry-referencing bundle. See Vendoring Charts for Air-Gap.
“failed to load manifest <path> for component <name>” — the recipe references a manifest path that does not exist in the current AICR binary’s embedded data. This usually means the recipe was generated by an older binary and a referenced manifest has since been removed or relocated. Regenerate the recipe with the current binary (aicr recipe ...) and re-bundle. AICR recipes are a point-in-time artifact of the binary that produced them; bundling a stale recipe against a newer binary is not supported.
--deployer argocd-helm: aicr-stack or <component>-pre / <component>-post Application stuck at Unknown sync status / “Failed to load target state: … <registry>/<path>:<tag>: not found” — Argo CD cannot resolve the OCI artifact the parent or path-based child Application points at. Common causes:
-
Chart name doubled in
--set repoURL. Under the current contract,--set repoURLcarries the parent namespace only (e.g.,oci://ghcr.io/myorg). The parent Application appends.Chart.Nameinto its OCIsource.repoURL, and path-based children append it directly into their renderedsource.repoURL. For non-OCI Helm repositories, the parent usessource.chartinstead. Passing--set repoURL=oci://ghcr.io/myorg/aicr-bundleproduces a double-suffixed reference (.../aicr-bundle/aicr-bundle:<tag>) that does not exist. Drop the trailing chart segment. -
Argo CD older than v2.13. Path-based children rely on Argo CD’s generic OCI artifact source type, added in v2.13. Older Argo treats the source as Git and fails to resolve. Check with
kubectl -n argocd get deploy argocd-repo-server -o jsonpath='\{.spec.template.spec.containers[0].image\}'. Upgrade Argo, or use--deployer helmif Argo upgrade is not an option. -
Tag missing from the registry. Verify the published artifact exists at the exact tag the parent expects:
oras manifest fetch <registry>/<path>/<chart>:<tag>. Ifaicr bundleis invoked without a tag (oci://<registry>/<path>/<chart>with no:<tag>suffix), the CLI version is used as the default — make sure--set targetRevision=<chart-version>at install time matches. -
Private registry credentials keyed to a different source URL. Problem: Argo CD matches repository credentials against the source URL it dereferences.
Failure case: For this deployer, path-based OCI Applications render full
oci://<registry>/<path>/<chart>source URLs even though--set repoURLis the parent namespace. A Secret keyed only to<registry>/<path>or to a scheme-less Helm-OCI URL may let localhelm installsucceed while Argo’s repo-server still returns 401.Solution: Key the Argo CD repository credential to the rendered
oci://.../<chart>prefix, or to a broader matching prefix allowed by your cluster’s credential policy, such asoci://<registry>/oroci://<registry>/<path>/.
External Data Directory
The --data flag enables extending or overriding the embedded recipe data with external files. This allows customization without rebuilding the CLI.
Overview
AICR embeds recipe data (overlays, component values, registry) at compile time. The --data flag layers an external directory on top, enabling:
- Custom components: Add new components to the registry
- Override values: Replace default component values files
- Custom overlays: Add new recipe overlays for specific environments
- Registry extensions: Add custom components while preserving embedded ones
Directory Structure
The external directory must mirror the embedded data structure:
Requirements
- registry.yaml is required: The external directory must contain a
registry.yamlfile - Security validations: Symlinks are rejected, file size is limited (10MB default)
- No path traversal: Paths containing
..are rejected
Merge Behavior
Usage Examples
Example: Adding a Custom Component
- Create external data directory:
- Create registry.yaml with custom component:
- Create values file for the component:
- Create overlay that includes the component:
- Generate recipe with external data:
Debugging External Data
Use --debug flag to see detailed logging about external data loading:
Debug logs include:
- External files discovered and registered
- File source resolution (embedded vs external)
- Registry merge details (components added/overridden)
Example Files
The examples/ directory contains reference files for testing and learning:
Recipes (examples/recipes/)
Usage:
Templates (examples/templates/)
Usage:
See Also
- Installation Guide - Install aicr
- Agent Deployment - Kubernetes agent setup
- API Reference - Programmatic access
- Architecture Docs - Internal architecture
- Data Architecture - Recipe data system details