CLI Reference | NVIDIA AI Cluster Runtime

Complete reference for the aicr command-line interface.

Overview

AICR provides a four-step workflow for optimizing GPU infrastructure:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Snapshot   │─────▶│    Recipe    │─────▶│   Validate   │─────▶│    Bundle    │
└──────────────┘      └──────────────┘      └──────────────┘      └──────────────┘

Step 1: Capture system configuration
Step 2: Generate optimization recipes
Step 3: Validate constraints against cluster
Step 4: Create deployment bundles

Global Flags

Available for all commands:

Flag	Short	Type	Default	Description
`--debug`		bool	false	Enable debug logging (text mode with full metadata)
`--log-json`		bool	false	Enable JSON logging (structured output for machine parsing)
`--help`	`-h`	bool	false	Show help
`--version`	`-v`	bool	false	Show version

Logging Modes

AICR supports three logging modes:

CLI Mode (default): Minimal user-friendly output
- Just message text without timestamps or metadata
- Error messages display in red (ANSI color)
- Example: Snapshot captured successfully
Text Mode (--debug): Debug output with full metadata
- Key=value format with time, level, source location
- Example: time=2025-01-06T10:30:00.123Z level=INFO module=aicr version=v1.0.0 msg="snapshot started"
JSON Mode (--log-json): Structured JSON for automation
- Machine-readable format for log aggregation
- Example: {"time":"2025-01-06T10:30:00.123Z","level":"INFO","msg":"snapshot started"}

Examples:

$ # Default: Clean CLI output
$ aicr snapshot
$ 
$ # Debug mode: Full metadata
$ aicr --debug snapshot
$ 
$ # JSON mode: Structured logs
$ aicr --log-json snapshot
$ 
$ # Combine with other flags
$ aicr --debug --output system.yaml snapshot

Commands

aicr snapshot

Capture comprehensive system configuration including OS, GPU, Kubernetes, and SystemD settings.

Synopsis:

$ aicr snapshot [flags]

Flags:

Flag	Short	Type	Default	Description
`--output`	`-o`	string	stdout	Output destination: file path, ConfigMap URI (cm://namespace/name), or stdout
`--format`		string	yaml	Output format: json, yaml, table
`--kubeconfig`	`-k`	string	~/.kube/config	Path to kubeconfig file (overrides KUBECONFIG env)
`--namespace`	`-n`	string	default	Kubernetes namespace for agent deployment
`--image`		string	ghcr.io/nvidia/aicr:latest	Container image for agent Job
`--job-name`		string	aicr	Name for the agent Job
`--service-account-name`		string	aicr	ServiceAccount name for agent Job
`--node-selector`		string[]		Node selector for agent scheduling (key=value, repeatable)
`--toleration`		string[]	all taints	Tolerations for agent scheduling (key=value:effect, repeatable). Default: all taints tolerated (uses `operator: Exists`). Only specify to restrict which taints are tolerated.
`--timeout`		duration	5m	Timeout for agent Job completion
`--no-cleanup`		bool	false	Skip removal of Job and RBAC resources on completion. Warning: leaves a cluster-admin ClusterRoleBinding active.
`--privileged`		bool	true	Run agent in privileged mode (required for GPU/SystemD collectors). Set to false for PSS-restricted namespaces.
`--image-pull-secret`		string[]		Image pull secrets for private registries (repeatable)
`--require-gpu`		bool	false	Require GPU resources on the agent pod (mutually exclusive with `--runtime-class`)
`--runtime-class`		string		Runtime class for GPU access without consuming a GPU allocation (e.g., `nvidia`). Mutually exclusive with `--require-gpu`.
`--template`		string		Path to Go template file for custom output formatting (requires YAML format)
`--max-nodes-per-entry`		int	0	Maximum node names per taint/label entry in topology collection (0 = unlimited)
`--os`		string		Node OS family (`ubuntu`, `rhel`, `cos`, `amazonlinux`, `talos`). Selects the per-OS pod configuration and in-pod service collector backend. `talos` skips the `/run/systemd` and `/etc/os-release` hostPath mounts and uses the Kubernetes-API service backend. Reads `AICR_OS` env when unset.

Output Destinations:

stdout: Default when no -o flag specified
File: Local file path (/path/to/snapshot.yaml)
ConfigMap: Kubernetes ConfigMap URI (cm://namespace/configmap-name)

What it captures:

SystemD Services: containerd, docker, kubelet configurations
OS Configuration: grub, kmod, sysctl, release info
Kubernetes: server version, images, ClusterPolicy
GPU: driver version, CUDA, MIG settings, hardware info
NodeTopology: node topology (cluster-wide taints and labels across all nodes)

Examples:

$ # Output to stdout (YAML)
$ aicr snapshot
$ 
$ # Save to file (JSON)
$ aicr snapshot --output system.json --format json
$ 
$ # Save to Kubernetes ConfigMap (requires cluster access)
$ aicr snapshot --output cm://gpu-operator/aicr-snapshot
$ 
$ # Debug mode
$ aicr --debug snapshot
$ 
$ # Table format (human-readable)
$ aicr snapshot --format table
$ 
$ # With custom kubeconfig
$ aicr snapshot --kubeconfig ~/.kube/prod-cluster
$ 
$ # Targeting specific nodes
$ aicr snapshot \
>   --namespace gpu-operator \
>   --node-selector accelerator=nvidia-h100 \
>   --node-selector zone=us-west1-a
$ 
$ # With tolerations for tainted nodes
$ # (By default all taints are tolerated - only needed to restrict tolerations)
$ aicr snapshot \
>   --toleration nvidia.com/gpu=present:NoSchedule
$ 
$ # Full example with all options
$ aicr snapshot \
>   --kubeconfig ~/.kube/config \
>   --namespace gpu-operator \
>   --image ghcr.io/nvidia/aicr:v0.8.0 \
>   --job-name snapshot-gpu-nodes \
>   --service-account-name aicr \
>   --node-selector accelerator=nvidia-h100 \
>   --toleration nvidia.com/gpu:NoSchedule \
>   --timeout 10m \
>   --output cm://gpu-operator/aicr-snapshot \
>   --no-cleanup
$ 
$ # Custom template formatting
$ aicr snapshot --template examples/templates/snapshot-template.md.tmpl
$ 
$ # Template with file output
$ aicr snapshot --template examples/templates/snapshot-template.md.tmpl --output report.md
$ 
$ # With custom template
$ aicr snapshot \
>   --namespace gpu-operator \
>   --template examples/templates/snapshot-template.md.tmpl \
>   --output cluster-report.yaml

Custom Templates

The --template flag enables custom output formatting using Go templates with Sprig functions. Templates receive the full Snapshot struct:

1 # Available template data structure:
2 .Kind           # Resource kind ("Snapshot")
3 .APIVersion     # API version string
4 .Metadata       # Map of key-value pairs (timestamp, version, source-node)
5 .Measurements   # Array of Measurement objects
6   .Type         # Measurement type (K8s, GPU, OS, SystemD, NodeTopology)
7   .Subtypes     # Array of Subtype objects
8     .Name       # Subtype name (e.g., "server", "smi", "grub")
9     .Data       # Map of readings (key -> Reading with .String method)
10 
11 # NodeTopology measurement type has subtypes: summary, taint, label
12 # Taint encoding: effect|value|node1,node2,...  (parseable with Sprig splitList "|")
13 # Label encoding: value|node1,node2,...

Example template extracting key cluster info:

1 cluster:
2   kubernetes: {{ with index .Measurements 0 }}{{ range .Subtypes }}{{ if eq .Name "server" }}
3     version: {{ (index .Data "version").String }}{{ end }}{{ end }}{{ end }}
4   gpu: {{ range .Measurements }}{{ if eq .Type.String "GPU" }}{{ range .Subtypes }}{{ if eq .Name "smi" }}
5     model: {{ (index .Data "gpu.model").String }}
6     count: {{ (index .Data "gpu-count").String }}{{ end }}{{ end }}{{ end }}{{ end }}

See examples/templates/snapshot-template.md.tmpl for a complete example template that generates a concise cluster report.

Agent Deployment Mode

When running against a cluster, AICR deploys a Kubernetes Job to capture the snapshot:

Deploys RBAC: ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding
Creates Job: Runs aicr snapshot as a container on the target node
Waits for completion: Monitors Job status with configurable timeout
Retrieves snapshot: Reads snapshot from ConfigMap after Job completes
Writes output: Saves snapshot to specified output destination
Cleanup: Deletes Job and RBAC resources (use --no-cleanup to keep for debugging)

Benefits of agent deployment:

Capture configuration from actual cluster nodes (not local machine)
No need to run kubectl manually
Programmatic deployment for automation/CI/CD
Reusable RBAC resources across multiple runs

Agent deployment requirements:

Kubernetes cluster access (via kubeconfig)
Cluster admin permissions (for RBAC creation)
GPU nodes with nvidia-smi (for GPU metrics)

ConfigMap Output

When using ConfigMap URIs (cm://namespace/name), the snapshot is stored directly in Kubernetes:

1 apiVersion: v1
2 kind: ConfigMap
3 metadata:
4   name: aicr-snapshot
5   namespace: gpu-operator
6   labels:
7     app.kubernetes.io/name: aicr
8     app.kubernetes.io/component: snapshot
9     app.kubernetes.io/version: v0.17.0
10 data:
11   snapshot.yaml: |
12     # Full snapshot content
13   format: yaml
14   timestamp: "2025-12-31T10:30:00Z"

Snapshot Structure:

1 apiVersion: aicr.nvidia.com/v1alpha1
2 kind: Snapshot
3 metadata:
4   created: "2025-12-31T10:30:00Z"
5   hostname: gpu-node-1
6 measurements:
7   - type: SystemD
8     subtypes: [...]
9   - type: OS
10     subtypes: [...]
11   - type: K8s
12     subtypes: [...]
13   - type: GPU
14     subtypes: [...]

aicr recipe

Generate optimized configuration recipes from query parameters or captured snapshots.

Synopsis:

$ aicr recipe [flags]

Modes:

Config File Mode (Recommended)

Generate recipes using an AICRConfig document. The same file format also drives the bundle command, so a single file can describe an end-to-end recipe-to-bundle workflow.

Flags:

Flag	Short	Type	Description
`--config`		string	Path or HTTP/HTTPS URL to an AICRConfig file (YAML/JSON)
`--output`	`-o`	string	Output file (default: stdout)
`--format`	`-f`	string	Format: json, yaml (default: yaml)
`--data`		string	External data directory to overlay on embedded data (see External Data)

The config file uses a Kubernetes-style envelope:

1 kind: AICRConfig
2 apiVersion: aicr.nvidia.com/v1alpha1
3 metadata:
4   name: gb200-eks-ubuntu-training
5 spec:
6   recipe:
7     criteria:
8       service: eks
9       os: ubuntu
10       accelerator: gb200
11       intent: training
12       nodes: 8
13     output:
14       path: recipe.yaml
15       format: yaml

Individual CLI flags always override config file values. For slice/map flags, presence on the CLI replaces the file’s value (no append).

$ # Load criteria from config file
$ aicr recipe --config config.yaml
$ 
$ # Override service from file
$ aicr recipe --config config.yaml --service gke
$ 
$ # Save output to file
$ aicr recipe --config config.yaml -o recipe.yaml
$ 
$ # Load config from a URL (e.g. CI shared template)
$ aicr recipe --config https://team.example.com/configs/eks-h100-training.yaml

--config accepts a local file path or an HTTP/HTTPS URL. ConfigMap (cm://) sources are not supported; export the data with kubectl get cm <name> -o yaml and pass the resulting file.

Query Mode

Generate recipes using direct system parameters:

Flags:

Flag	Short	Type	Description
`--service`		string	K8s service: eks, gke, aks, oke, kind, lke
`--accelerator`	`--gpu`	string	Accelerator/GPU type: h100, gb200, b200, a100, l40, rtx-pro-6000
`--intent`		string	Workload intent: training, inference
`--os`		string	OS family: ubuntu, rhel, cos, amazonlinux, talos
`--platform`		string	Platform/framework type: dynamo, kubeflow, nim
`--nodes`		int	Number of GPU nodes in the cluster
`--output`	`-o`	string	Output file (default: stdout)
`--format`	`-f`	string	Format: json, yaml (default: yaml)
`--data`		string	External data directory to overlay on embedded data (see External Data)

Examples:

$ # Basic recipe for Ubuntu on EKS with H100
$ aicr recipe --os ubuntu --service eks --accelerator h100
$ 
$ # Training workload with multiple GPU nodes
$ aicr recipe \
>   --service eks \
>   --accelerator gb200 \
>   --intent training \
>   --os ubuntu \
>   --nodes 8 \
>   --format yaml
$ 
$ # Kubeflow training workload
$ aicr recipe \
>   --service eks \
>   --accelerator h100 \
>   --intent training \
>   --os ubuntu \
>   --platform kubeflow
$ 
$ # Save to file (--gpu is an alias for --accelerator)
$ aicr recipe --os ubuntu --gpu h100 --output recipe.yaml

Snapshot Mode

Generate recipes from captured snapshots:

Flags:

Flag	Short	Type	Description
`--snapshot`	`-s`	string	Path/URI to snapshot (file path, URL, or cm://namespace/name)
`--intent`	`-i`	string	Workload intent: training, inference
`--output`	`-o`	string	Output destination (file, ConfigMap URI, or stdout)
`--format`		string	Format: json, yaml (default: yaml)
`--kubeconfig`	`-k`	string	Path to kubeconfig file (for ConfigMap URIs, overrides KUBECONFIG env)

Snapshot Sources:

File: Local file path (./snapshot.yaml)
URL: HTTP/HTTPS URL (https://example.com/snapshot.yaml)
ConfigMap: Kubernetes ConfigMap URI (cm://namespace/configmap-name)

Examples:

$ # Generate recipe from local snapshot file
$ aicr recipe --snapshot system.yaml --intent training
$ 
$ # From ConfigMap (requires cluster access)
$ aicr recipe --snapshot cm://gpu-operator/aicr-snapshot --intent training
$ 
$ # From ConfigMap with custom kubeconfig
$ aicr recipe \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --kubeconfig ~/.kube/prod-cluster \
>   --intent training
$ 
$ # Output to ConfigMap
$ aicr recipe -s system.yaml -o cm://gpu-operator/aicr-recipe
$ 
$ # Chain snapshot → recipe with ConfigMaps
$ aicr snapshot -o cm://default/snapshot
$ aicr recipe -s cm://default/snapshot -o cm://default/recipe
$ 
$ # With custom output
$ aicr recipe -s system.yaml -i inference -o recipe.yaml --format yaml

Output structure:

1 apiVersion: aicr.nvidia.com/v1alpha1
2 kind: Recipe
3 metadata:
4   version: v1.0.0
5   created: "2025-12-31T10:30:00Z"
6   appliedOverlays:
7     - base
8     - eks
9     - eks-training
10     - gb200-eks-training
11 criteria:
12   service: eks
13   accelerator: gb200
14   intent: training
15   os: any
16 componentRefs:
17   - name: gpu-operator
18     version: v25.3.3
19     order: 1
20     repository: https://helm.ngc.nvidia.com/nvidia
21 constraints:
22   driver:
23     version: "580.82.07"
24     cudaVersion: "13.1"

aicr query

Query a specific value from the fully hydrated recipe configuration. Resolves a recipe from criteria (same as aicr recipe), merges all base, overlay, and inline value overrides, then extracts the value at the given dot-path selector.

Synopsis:

$ aicr query --selector <path> [flags]

Flags:

All aicr recipe flags are supported, plus:

Flag	Type	Description
`--selector`	string	Required. Dot-path to the configuration value to extract

Selector Syntax

Uses dot-delimited paths consistent with Helm --set and yq:

Selector	Returns
`components.<name>.values.<path>`	Hydrated Helm value (scalar or subtree)
`components.<name>.chart`	Component metadata field
`components.<name>`	Entire hydrated component block
`criteria.<field>`	Recipe criteria field
`deploymentOrder`	Component deployment order list
`constraints`	Merged constraint list
`.` or empty	Entire hydrated recipe

Leading dots are optional (yq-style): .components.gpu-operator.chart and components.gpu-operator.chart are equivalent.

Output:

Scalar values (string, number, bool) are printed as plain text — no YAML wrapper
Complex values (maps, lists) are printed as YAML (default) or JSON (--format json)

Examples:

$ # Get a specific Helm value
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver.version
$ # stdout: 570.86.16
$ 
$ # Get a value subtree
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver
$ # stdout:
$ #   version: "570.86.16"
$ #   repository: nvcr.io/nvidia
$ 
$ # Get the full hydrated component
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator
$ 
$ # Get deployment order
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector deploymentOrder
$ 
$ # Use in shell scripts
$ VERSION=$(aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver.version)
$ echo "Driver version: $VERSION"
$ 
$ # JSON output for complex values
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values --format json
$ 
$ # Query from snapshot
$ aicr query --snapshot snapshot.yaml \
>   --selector components.gpu-operator.values.driver.version
$ 
$ # Full hydrated recipe
$ aicr query --service eks --accelerator h100 --intent training --selector .

Advanced Examples:

$ # Cross-cloud comparison: how Prometheus storage differs between EKS and GKE
$ # EKS provisions a 50Gi persistent EBS volume (gp2)
$ aicr query --service eks --intent training \
>   --selector components.kube-prometheus-stack.values.prometheus.prometheusSpec.storageSpec
$ # GKE uses a 10Gi ephemeral emptyDir (GMP handles long-term retention)
$ aicr query --service gke --intent training \
>   --selector components.kube-prometheus-stack.values.prometheus.prometheusSpec.storageSpec
$ 
$ # Compare deployment order across clouds
$ # EKS deploys 12 components (includes aws-ebs-csi-driver, aws-efa, nodewright-customizations)
$ aicr query --service eks --accelerator h100 --intent training --selector deploymentOrder
$ # GKE deploys 9 components (storage and networking are platform-managed)
$ aicr query --service gke --accelerator h100 --intent training --selector deploymentOrder
$ 
$ # Pin the exact driver version into Terraform/Pulumi variables
$ DRIVER_VERSION=$(aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver.version)
$ echo "gpu_driver_version = \"${DRIVER_VERSION}\""
$ 
$ # Compare nodewright tuning parameters across accelerators
$ # H100: real tuning packages (kernel setup, nvidia-tuned, full setup)
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.nodewright-customizations.values
$ # GB200: same value structure, but manifest renders a no-op (ARM64 packages pending)
$ aicr query --service eks --accelerator gb200 --intent training \
>   --selector components.nodewright-customizations.values
$ 
$ # Watch constraints tighten as you add specificity
$ # Just "EKS" → 1 constraint (K8s >= 1.28)
$ aicr query --service eks --selector constraints
$ # Add GPU + intent + OS → 4 constraints (K8s >= 1.32.4, Ubuntu 24.04, kernel >= 6.8)
$ aicr query --service eks --accelerator h100 --intent training --os ubuntu \
>   --selector constraints

aicr validate

Validate a system snapshot against the constraints defined in a recipe to verify cluster compatibility. Supports multi-phase validation with different validation stages.

For a task-oriented walkthrough (capture snapshot → generate recipe → run each phase, with worked training and inference examples), see Validation.

Synopsis:

$ aicr validate [flags]

Flags:

Flag	Short	Type	Default	Description
`--recipe`	`-r`	string	(required)	Path/URI to recipe file containing constraints
`--snapshot`	`-s`	string		Path/URI to snapshot file containing measurements (omit to capture live)
`--phase`		string[]	all	Validation phase to run: deployment, performance, conformance, all (repeatable)
`--fail-on-error`		bool	true	Exit with non-zero status if any constraint fails
`--output`	`-o`	string	stdout	Output destination (file or stdout)
`--kubeconfig`	`-k`	string	~/.kube/config	Path to kubeconfig file (for ConfigMap URIs)
`--namespace`	`-n`	string	aicr-validation	Kubernetes namespace for validation Job deployment
`--image`		string	ghcr.io/nvidia/aicr:latest	Container image for validation Job
`--image-pull-secret`		string[]		Image pull secrets for private registries (repeatable)
`--job-name`		string	aicr-validate	Name for the validation Job
`--service-account-name`		string	aicr	ServiceAccount name for validation Job
`--node-selector`		string[]		Override GPU node selection for validation workloads. Replaces platform-specific selectors (e.g., `cloud.google.com/gke-accelerator`, `node.kubernetes.io/instance-type`) on inner workloads like NCCL benchmark pods. Use when GPU nodes have non-standard labels. Does not affect the validator orchestrator Job. (format: key=value, repeatable)
`--toleration`		string[]		Override tolerations for validation workloads. Replaces the default tolerate-all policy on inner workloads like NCCL benchmark pods and conformance test pods. Does not affect the validator orchestrator Job. (format: key=value:effect, repeatable)
`--timeout`		duration	5m	Timeout for validation Job completion
`--no-cleanup`		bool	false	Skip removal of Job and RBAC resources on completion
`--require-gpu`		bool	false	Require GPU resources on the validation pod
`--no-cluster`		bool	false	Skip cluster access (test mode): skips RBAC and Job deployment, reports checks as skipped
`--evidence-dir`		string		Directory to write conformance evidence artifacts
`--cncf-submission`		bool	false	Generate CNCF conformance submission artifacts
`--feature`	`-f`	string[]		Feature flags for validation (repeatable)
`--data`		string		External data directory to overlay on embedded data

Input Sources:

File: Local file path (./recipe.yaml, ./snapshot.yaml)
URL: HTTP/HTTPS URL (https://example.com/recipe.yaml)
ConfigMap: Kubernetes ConfigMap URI (cm://namespace/configmap-name)

Validation Phases

Validation can be run in different phases to validate different aspects of the deployment:

Phase	Description	When to Run
`deployment`	Validates component deployment completeness plus post-install GPU readiness signals (see below)	After deploying components
`performance`	Validates system performance and network fabric health	After components are running
`conformance`	Validates workload-specific requirements and conformance	Before running production workloads
`all`	Runs all phases sequentially with dependency logic	Complete end-to-end validation

Note: Readiness constraints (K8s version, OS, kernel) are always evaluated implicitly before any phase runs. If readiness fails, validation stops before deploying any Jobs.

Deployment phase checks:

The deployment phase verifies that the cluster is actually ready for GPU workloads — not just that install commands returned successfully. It covers:

Enabled component namespaces are Active.
Declared expectedResources (Deployments, DaemonSets, etc.) exist and are healthy.
When nodewright-customizations is enabled: every Skyhook CR the recipe declares reports status.status == complete. The set of expected CR names is extracted from the recipe’s own ComponentRef.ManifestFiles for this component, so unrelated Skyhook CRs on the cluster (stale from prior deploys, or owned by another tenant) are ignored. If no Skyhook names can be extracted from those ManifestFiles, deployment validation fails closed as a recipe/configuration error instead of skipping.
When gpu-operator is enabled: ClusterPolicy reports status.state == ready.
When nvidia-dra-driver-gpu is enabled: the kubelet-plugin DaemonSet is ready. Discovery is by the upstream chart’s role-suffix convention — the validator finds the single DaemonSet in the component namespace whose name ends in -kubelet-plugin, so custom fullnameOverride values are handled automatically.

Graceful skip: If a component is declared in the recipe but its CRD is not yet registered on the cluster (e.g., fresh cluster, operator chart not installed), the corresponding readiness check is skipped rather than failing. Once the CRD is present, the check runs and a missing CR is treated as a real failure — for example, if the gpu-operator CRD is registered but no ClusterPolicy CR exists, deployment validation fails with a “CR missing” diagnostic rather than silently passing. Other errors still fail closed: an RBAC denial on skyhooks.skyhook.nvidia.com returns HTTP 403 (not a NoMatch), so the validator surfaces it as a failure instead of silently skipping the Skyhook check.

Day-N re-verification: Because this is a read-only check against live cluster state, re-running aicr validate --phase deployment after scale-up, upgrade, or other runtime changes is safe and answers the same “is this cluster ready for GPU workloads now?” question.

Phase Dependencies:

Phases run sequentially when using --phase all
If a phase fails, subsequent phases are skipped
Use individual phases for targeted validation during specific deployment stages

Constraint Format

Constraints use fully qualified measurement paths: {Type}.{Subtype}.{Key}

Constraint Path	Description
`K8s.server.version`	Kubernetes server version
`OS.release.ID`	Operating system identifier (ubuntu, rhel)
`OS.release.VERSION_ID`	OS version (24.04, 22.04)
`OS.sysctl./proc/sys/kernel/osrelease`	Kernel version
`GPU.info.type`	GPU hardware type

Supported Operators

Operator	Example	Description
`>=`	`>= 1.30`	Greater than or equal (version comparison)
`<=`	`<= 1.33`	Less than or equal (version comparison)
`>`	`> 1.30`	Greater than (version comparison)
`<`	`< 2.0`	Less than (version comparison)
`==`	`== ubuntu`	Explicit equality
`!=`	`!= rhel`	Not equal
(none)	`ubuntu`	Exact string match

Examples:

$ # Validate snapshot against recipe (readiness constraints run implicitly)
$ aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
$ 
$ # Validate specific phase
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --phase deployment
$ 
$ # Run all validation phases
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --phase all
$ 
$ # Load snapshot from ConfigMap
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot cm://gpu-operator/aicr-snapshot
$ 
$ # Save results to file
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --output validation-results.json
$ 
$ # Validate deployment phase after components are installed
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --phase deployment
$ 
$ # Run performance validation
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml \
>   --phase performance
$ 
$ # With custom kubeconfig
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --kubeconfig ~/.kube/prod-cluster
$ 
$ # Validate on a cluster with custom GPU node labels (non-standard labels that AICR doesn't
> # recognize by default, e.g., using a custom node pool label instead of cloud-provider defaults)
> aicr validate \
>   --recipe recipe.yaml \
>   --node-selector my-org/gpu-pool=true \
>   --phase performance
> 
> # Override both node selector and tolerations for a non-standard taint setup
> aicr validate \
>   --recipe recipe.yaml \
>   --node-selector gpu-type=h100 \
>   --toleration gpu-type=h100:NoSchedule

Workload Scheduling

The --node-selector and --toleration flags control scheduling for validation workloads — the inner pods that validators create to test cluster functionality (e.g., NCCL benchmark workers, conformance test pods). They do not affect the validator orchestrator Job, which runs lightweight check logic and is placed on CPU-preferred nodes automatically.

When --node-selector is provided, it replaces the platform-specific selectors that validators use by default:

Platform	Default Selector (replaced)	Use Case
GKE	`cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb`	Non-standard GPU node pool labels
EKS	`node.kubernetes.io/instance-type: <discovered>`	Custom node pool labels

When --toleration is provided, it replaces the default tolerate-all policy (operator: Exists) on workloads that need to land on tainted GPU nodes.

Validators that use nodeName pinning (nvidia-smi, DRA isolation test) or DRA ResourceClaims for placement (gang scheduling) are not affected by these flags.

Output Structure (CTRF JSON):

Results are output in CTRF (Common Test Report Format) — an industry-standard schema for test reporting.

1 {
2   "reportFormat": "CTRF",
3   "specVersion": "0.0.1",
4   "timestamp": "2026-03-10T20:10:44Z",
5   "generatedBy": "aicr",
6   "results": {
7     "tool": {
8       "name": "aicr",
9       "version": "v0.10.3-next"
10     },
11     "summary": {
12       "tests": 16,
13       "passed": 13,
14       "failed": 0,
15       "skipped": 3,
16       "pending": 0,
17       "other": 0,
18       "start": 1773173400872,
19       "stop": 1773173799002
20     },
21     "tests": [
22       {
23         "name": "operator-health",
24         "status": "passed",
25         "duration": 0,
26         "suite": ["deployment"],
27         "stdout": ["Found 1 gpu-operator pod(s)", "Running: 1/1"]
28       },
29       {
30         "name": "expected-resources",
31         "status": "passed",
32         "duration": 0,
33         "suite": ["deployment"],
34         "stdout": ["All deployment resources and required readiness signals are healthy"]
35       },
36       {
37         "name": "nccl-all-reduce-bw",
38         "status": "passed",
39         "duration": 234000,
40         "suite": ["performance"],
41         "stdout": ["NCCL All Reduce bandwidth: 488.37 GB/s", "Constraint: >= 100 → true"]
42       },
43       {
44         "name": "inference-perf",
45         "status": "passed",
46         "duration": 612000,
47         "suite": ["performance"],
48         "stdout": [
49           "RESULT: Inference throughput: 37961.24 tokens/sec",
50           "RESULT: Inference TTFT p99: 146.30 ms",
51           "Throughput constraint: >= 5000 → PASS",
52           "TTFT p99 constraint: <= 200 → PASS"
53         ]
54       },
55       {
56         "name": "dra-support",
57         "status": "passed",
58         "duration": 8000,
59         "suite": ["conformance"],
60         "stdout": ["DRA GPU allocation successful"]
61       },
62       {
63         "name": "cluster-autoscaling",
64         "status": "skipped",
65         "duration": 0,
66         "suite": ["conformance"],
67         "stdout": ["SKIP reason=\"Karpenter not found\""]
68       }
69     ]
70   }
71 }

Note: The tests array above is truncated for brevity. A full validation run produces one entry per check across all phases. Each entry includes stdout with detailed diagnostic output.

Test Statuses:

Status	Description
`passed`	Check or constraint passed
`failed`	Check or constraint failed
`skipped`	Check could not be evaluated (missing data, no-cluster mode)
`other`	Unexpected outcome (crash, OOM, timeout)

Exit Codes:

Code	Description
`0`	All checks passed
`2`	Invalid input (bad flags, missing recipe)
`8`	One or more checks failed (when `--fail-on-error` is set)

aicr bundle

Generate deployment-ready bundles from recipes containing Helm values, manifests, scripts, and documentation.

Synopsis:

$ aicr bundle [flags]

Flags:

Flag	Short	Type	Description
`--recipe`	`-r`	string	Path to recipe file (required, or via `spec.bundle.input.recipe` in `--config`)
`--config`		string	Path or HTTP/HTTPS URL to an AICRConfig file (YAML/JSON). CLI flags override values from this file. See Bundle Config File Mode.
`--output`	`-o`	string	Output directory (default: current dir)
`--deployer`	`-d`	string	Deployment method: `helm` (default), `argocd`, or `argocd-helm`
`--repo`		string	Git/OCI repository URL baked into Argo CD Application sources. Used with `--deployer argocd`. Ignored with `--deployer argocd-helm` (that bundle is URL-portable — the URL is supplied at `helm install` time via `--set repoURL=...`); a warning is logged if passed.
`--set`		string[]	Override values in bundle files (repeatable). Use `enabled` key to include/exclude components (e.g., `--set awsebscsidriver:enabled=false`)
`--dynamic`		string[]	Declare value paths as install-time parameters (repeatable, format: `component:path`). Supported with `helm` and `argocd-helm` deployers. See Dynamic Install-Time Values.
`--data`		string	External data directory to overlay on embedded data (see External Data)
`--system-node-selector`		string[]	Node selector for system components (format: key=value, repeatable)
`--system-node-toleration`		string[]	Toleration for system components (format: key=value:effect, repeatable)
`--accelerated-node-selector`		string[]	Node selector for accelerated/GPU nodes (format: key=value, repeatable)
`--accelerated-node-toleration`		string[]	Toleration for accelerated/GPU nodes (format: key=value:effect, repeatable)
`--workload-gate`		string	Taint for nodewright-operator runtime required (format: key=value:effect or key:effect). This is a day 2 option for cluster scaling operations.
`--workload-selector`		string[]	Label selector for nodewright-customizations to prevent eviction of running training jobs (format: key=value, repeatable). Required when nodewright-customizations is enabled with training intent.
`--nodes`		int	Estimated number of GPU nodes (default: 0 = unset). At bundle time, written to Helm value paths declared in the registry under `nodeScheduling.nodeCountPaths`.
`--storage-class`		string	Kubernetes StorageClass name to inject at bundle time. Written to registry-declared `storageClassPaths` for each component. Overrides any `storageClassName` set in recipe overlays.
`--kubeconfig`	`-k`	string	Path to kubeconfig file
`--insecure-tls`		bool	Skip TLS verification for OCI registry connections
`--plain-http`		bool	Use plain HTTP for OCI registry connections
`--image-refs`		string	Path to image references file for OCI registry
`--attest`		bool	Enable bundle attestation and binary provenance verification. Requires OIDC authentication. See Bundle Attestation.
`--certificate-identity-regexp`		string	Override the certificate identity pattern for binary attestation verification. Must contain `"NVIDIA/aicr"`. For testing only.
`--identity-token`		string	Pre-fetched OIDC identity token for `--attest` keyless signing. Skips ambient/browser/device-code flows. Prefer `COSIGN_IDENTITY_TOKEN` on shared hosts — flag values are visible in `ps` and `/proc/<pid>/cmdline`.
`--oidc-device-flow`		bool	Use the OAuth 2.0 device authorization grant for `--attest` instead of opening a browser callback. Useful on headless hosts that can still reach Sigstore (`--identity-token` and CI ambient OIDC are alternatives). Also reads `AICR_OIDC_DEVICE_FLOW`.

Bundle Config File Mode

The bundle command accepts the same AICRConfig format used by aicr recipe. A single file can populate both spec.recipe and spec.bundle, capturing an end-to-end workflow that can be committed to git, fetched from CI, or shared across environments.

When both spec.recipe.output.path and spec.bundle.input.recipe are set, they must reference the same path; otherwise loading fails fast.

1 kind: AICRConfig
2 apiVersion: aicr.nvidia.com/v1alpha1
3 spec:
4   bundle:
5     input:
6       recipe: ./recipe.yaml
7     output:
8       target: oci://ghcr.io/example/bundle:v1.0.0
9     deployment:
10       deployer: argocd
11       repo: https://example.git/charts
12       set:
13         - gpuoperator:driver.version=570.86.16
14     scheduling:
15       systemNodeSelector:
16         role: system
17       acceleratedNodeTolerations:
18         - "nvidia.com/gpu=present:NoSchedule"
19       nodes: 8
20       storageClass: gp3
21     attestation:
22       enabled: false
23     registry:
24       insecureTLS: false
25       plainHTTP: false

$ # Drive the bundle entirely from a config file
$ aicr bundle --config bundle.yaml
$ 
$ # Override the deployer for a one-off run
$ aicr bundle --config bundle.yaml --deployer helm

CLI flags always override values loaded from --config. For slice/map flags (--set, --dynamic, --system-node-selector, etc.), CLI presence replaces the config’s value rather than appending. Override events are logged at INFO so users can see which input won.

Secrets: the cosign identity token is never read from a config file; supply it via --identity-token or COSIGN_IDENTITY_TOKEN.

Node Scheduling

The --accelerated-node-selector and --accelerated-node-toleration flags control scheduling for GPU-specific components:

Flag	GPU Daemonsets	NFD Workers
`--accelerated-node-selector`	Applied (restricts to GPU nodes)	Not applied (NFD runs on all nodes)
`--accelerated-node-toleration`	Applied	Applied
`--system-node-selector`	Not applied	Not applied
`--system-node-toleration`	Not applied	Not applied

NFD (Node Feature Discovery) workers must run on all nodes (GPU, CPU, and system) to detect hardware features. This matches the gpu-operator default behavior where NFD workers also run on control-plane nodes. The --accelerated-node-selector is intentionally not applied to NFD workers so they are not restricted to GPU nodes.

Note: When no --accelerated-node-toleration is specified, a default toleration (operator: Exists) is applied to both GPU daemonsets and NFD workers, allowing them to run on nodes with any taint.

Example:

$ aicr bundle --recipe recipe.yaml \
>   --accelerated-node-selector nodeGroup=gpu-worker \
>   --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
>   --accelerated-node-toleration dedicated=worker-workload:NoExecute \
>   --system-node-selector nodeGroup=system-worker \
>   --system-node-toleration dedicated=system-workload:NoSchedule \
>   --system-node-toleration dedicated=system-workload:NoExecute \
>   --output bundle

Cluster node requirements: This example assumes the cluster has nodes labeled nodeGroup=system-worker with taints dedicated=system-workload:NoSchedule,NoExecute for system infrastructure, and GPU nodes labeled nodeGroup=gpu-worker with taints dedicated=worker-workload:NoSchedule,NoExecute.

This results in:

GPU daemonsets (driver, device-plugin, toolkit, dcgm): nodeSelector=nodeGroup=gpu-worker + tolerations for dedicated=worker-workload with both NoSchedule and NoExecute
NFD workers: no nodeSelector (runs on all nodes) + tolerations for dedicated=worker-workload with both NoSchedule and NoExecute
System components (gpu-operator controller, NFD gc/master, dynamo grove, kgateway proxy): nodeSelector=nodeGroup=system-worker + tolerations for dedicated=system-workload with both NoSchedule and NoExecute

Behavior:

All components from the recipe are bundled automatically
Each component creates a subdirectory in the output directory
Components are deployed in the order specified by deploymentOrder in the recipe

Storage Class (`--storage-class`)

The --storage-class flag injects a Kubernetes StorageClass name into components at bundle time. StorageClass is a cluster infrastructure detail — the right value depends on what the target cluster has provisioned, not on the recipe.

When provided, the value is written to all Helm value paths declared in the component registry under storageClassPaths, overriding any storageClassName set in recipe overlays. If a per-component --set <component>:<path>=<value> explicitly targets the same path, that value takes precedence over --storage-class.

Example:

$ # Use EBS gp3 instead of the overlay default gp2 on EKS
$ aicr bundle --recipe recipe.yaml \
>   --storage-class gp3 \
>   --output bundle
$ 
$ # Use a custom storage class on an on-prem cluster
$ aicr bundle --recipe recipe.yaml \
>   --storage-class local-path \
>   --output bundle

When --storage-class is not set, any storageClassName values already defined in the recipe overlays are preserved as defaults. When it is set, --set <component>:<path>=<value> on the same path still wins — --storage-class only fills in paths that were not explicitly overridden.

Deployment Methods (`--deployer`)

The --deployer flag controls how deployment artifacts are generated:

Method	Description
`helm`	(Default) Generates Helm charts with values for deployment. Supports `--dynamic`.
`argocd`	Generates Argo CD Application manifests for GitOps deployment. Does not support `--dynamic`.
`argocd-helm`	Generates a Helm chart app-of-apps for Argo CD. All values overridable at install time via `helm --set`. Use `--dynamic` to pre-populate specific paths.

Note: --dynamic is not supported with --deployer argocd. Use --deployer argocd-helm instead, which produces a Helm chart where all values are overridable at install time.

Deployment Order:

All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:

Helm: Components listed in README in deployment order
Argo CD: Uses argocd.argoproj.io/sync-wave annotation (0 = first, 1 = second, etc.)

Value Overrides (`--set`)

Override any value in the generated bundle files using dot notation:

$ --set bundler:path.to.field=value

Format: bundler:path=value where:

bundler - Bundler name (e.g., gpuoperator, networkoperator, certmanager, nodewright-operator, nvsentinel)
path - Dot-separated path to the field
value - New value to set

Behavior:

Duplicate keys: When the same bundler:path is specified multiple times, the last value wins
Array values: Individual array elements cannot be overridden (no [0] index syntax). Arrays can only be replaced entirely via recipe overrides, not via --set flags. Use recipe-level overrides in componentRefs[].overrides if you need to replace an entire array.
Type conversion: String values are automatically converted to appropriate types (true/false → bool, numeric strings → numbers)
Component enable/disable: The special enabled key controls whether a component is included in the bundle. --set <component>:enabled=false excludes the component; --set <component>:enabled=true re-enables a recipe-disabled component. The enabled key is consumed by the bundler and not passed to Helm chart values.

Examples:

$ # Generate all bundles
$ aicr bundle --recipe recipe.yaml --output ./bundles
$ 
$ # Override values in GPU Operator bundle
$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:gds.enabled=true \
>   --set gpuoperator:driver.version=570.86.16 \
>   -o ./bundles
$ 
$ # Override multiple components
$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:mig.strategy=mixed \
>   --set networkoperator:rdma.enabled=true \
>   --set networkoperator:sriov.enabled=true \
>   -o ./bundles
$ 
$ # Override cert-manager resources
$ aicr bundle -r recipe.yaml \
>   --set certmanager:controller.resources.memory.limit=512Mi \
>   --set certmanager:webhook.resources.cpu.limit=200m \
>   -o ./bundles
$ 
$ # Override Nodewright manager resources
$ aicr bundle -r recipe.yaml \
>   --set nodewright-operator:manager.resources.cpu.limit=500m \
>   --set nodewright-operator:manager.resources.memory.limit=256Mi \
>   -o ./bundles
$ 
$ # Disable a component at bundle time (e.g., EBS CSI already installed as EKS addon)
$ aicr bundle -r recipe.yaml \
>   --set awsebscsidriver:enabled=false \
>   -o ./bundles
$ 
$ # Schedule system components on specific node pool
$ aicr bundle -r recipe.yaml \
>   --system-node-selector nodeGroup=system-pool \
>   --system-node-toleration dedicated=system:NoSchedule \
>   -o ./bundles
$ 
$ # Schedule GPU workloads on labeled GPU nodes
$ aicr bundle -r recipe.yaml \
>   --accelerated-node-selector nvidia.com/gpu.present=true \
>   --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
>   -o ./bundles
$ 
$ # Combined: separate system and GPU scheduling
$ aicr bundle -r recipe.yaml \
>   --system-node-selector nodeGroup=system-pool \
>   --system-node-toleration dedicated=system:NoSchedule \
>   --accelerated-node-selector accelerator=nvidia-h100 \
>   --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
>   -o ./bundles
$ 
$ # Set estimated GPU node count (writes to nodeCountPaths in registry)
$ aicr bundle -r recipe.yaml --nodes 8 -o ./bundles
$ 
$ # Day 2 options: workload-gate and workload-selector for nodewright
$ aicr bundle -r recipe.yaml \
>   --workload-gate skyhook.nvidia.com/runtime-required=true:NoSchedule \
>   --workload-selector workload-type=training \
>   -o ./bundles
$ 
$ # Generate an attested bundle (opens browser for OIDC auth)
$ aicr bundle -r recipe.yaml --attest -o ./bundles
$ 
$ # In GitHub Actions (OIDC token detected automatically)
$ aicr bundle -r recipe.yaml --attest -o ./bundles
$ 
$ # Generate Argo CD Application manifests for GitOps
$ aicr bundle -r recipe.yaml --deployer argocd -o ./bundles
$ 
$ # Argo CD with Git repository URL (avoids placeholder in app-of-apps.yaml)
$ aicr bundle -r recipe.yaml --deployer argocd \
>   --repo https://github.com/my-org/my-gitops-repo.git \
>   -o ./bundles
$ 
$ # Combine deployer with value overrides
$ aicr bundle -r recipe.yaml \
>   --deployer argocd \
>   -o ./bundles

Dynamic Install-Time Values

The --dynamic flag declares value paths that are cluster-specific and should be provided at install time rather than baked into the bundle at build time. This enables building a single bundle that can be deployed to multiple clusters with different configurations.

Use --dynamic for values that genuinely vary per cluster — cluster names, subnet IDs, endpoint URLs, region-specific settings. For values that are static per bundle but differ from the recipe default (e.g., a specific driver version), use --set instead.

Use case	Flag	Example
Cluster-specific value (varies per deployment)	`--dynamic`	`--dynamic alloy:clusterName`
Static override (same for all deployments of this bundle)	`--set`	`--set gpuoperator:driver.version=580.105.08`

Attestation scope: Dynamic values are supplied at install time and are not covered by --attest. Attestation binds the shipped bundle (defaults and stubs), not operator-provided overrides. If you need to constrain dynamic values at deploy time, use admission control or Argo sync hooks — see Attestation Scope.

$ --dynamic component:path.to.field

Format: component:path where:

component - Component name or override key (same keys as --set, e.g., gpuoperator, alloy)
path - Dot-separated path to the value that varies per cluster

Helm deployer behavior:

Dynamic paths are removed from values.yaml and written to a separate cluster-values.yaml per component. The generated deploy.sh passes both files to Helm:

$ helm upgrade --install gpu-operator ... \
>   -f values.yaml \
>   -f cluster-values.yaml

Before deploying, fill in cluster-values.yaml with cluster-specific values.

Argo CD deployer behavior:

The --deployer argocd-helm generates a Helm chart app-of-apps where all values are overridable at install time. Static values are baked into the chart as files; dynamic overrides are merged on top at render time. Use --dynamic to pre-populate specific paths in the root values.yaml:

$ helm install aicr-bundle ./bundle \
>   --set alloy.clusterName=prod-east \
>   --set alloy.subnetName=subnet-abc123

Examples:

$ # Helm: declare cluster name as install-time parameter
$ aicr bundle -r recipe.yaml \
>   --dynamic alloy:clusterName \
>   -o ./bundles
$ 
$ # Helm: multiple dynamic paths across components
$ aicr bundle -r recipe.yaml \
>   --dynamic alloy:clusterName \
>   --dynamic alloy:subnetName \
>   -o ./bundles
$ 
$ # Helm: combine with --set (static overrides + dynamic cluster-specific values)
$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:driver.version=580.105.08 \
>   --dynamic alloy:clusterName \
>   -o ./bundles
$ 
$ # Argo CD Helm chart: all values overridable, --dynamic pre-populates specific paths
$ aicr bundle -r recipe.yaml \
>   --deployer argocd-helm \
>   --dynamic alloy:clusterName \
>   -o ./bundles
$ 
$ # Argo CD Helm chart: without --dynamic, still fully overridable via helm --set
$ aicr bundle -r recipe.yaml \
>   --deployer argocd-helm \
>   -o ./bundles

Bundle structure with --dynamic (Helm deployer):

bundles/
├── alloy/
│   ├── values.yaml                # Static values (clusterName removed)
│   └── cluster-values.yaml        # Dynamic values (override before deploying)
├── gpu-operator/
│   └── values.yaml                # No dynamic values, no cluster-values.yaml
├── deploy.sh                      # Passes -f cluster-values.yaml when present
└── ...

Argo CD Helm chart structure with --dynamic:

The --deployer argocd-helm bundle is itself a Helm chart whose templates/ create per-component Argo Applications. Each application’s helm.values block merges static values (loaded via .Files.Get for upstream-helm components, or read from the wrapped chart’s own values.yaml for local-chart components) with dynamic overrides from the parent chart’s .Values.

The same uniform NNN-<component>/ folder layout used by --deployer argocd is included at the bundle root so that path-based Argo Applications (manifest-only, kustomize-wrapped, mixed -post) can resolve their path: references against the OCI-published bundle.

bundles/
├── Chart.yaml                          # Parent chart metadata
├── values.yaml                         # Dynamic stubs only (per-cluster surface)
├── templates/
│   ├── aicr-stack.yaml                 # Parent Argo Application (renders all children)
│   ├── cert-manager.yaml               # Argo App, multi-source (upstream-helm)
│   ├── gpu-operator.yaml               # Argo App, multi-source
│   ├── gpu-operator-post.yaml          # Argo App, path-based (mixed -post)
│   └── nodewright-customizations.yaml  # Argo App, path-based (manifest-only)
├── static/
│   ├── cert-manager.yaml               # Static values for upstream-helm Applications
│   └── gpu-operator.yaml
├── 001-cert-manager/                   # NNN-folder content (KindUpstreamHelm)
│   └── values.yaml
├── 002-gpu-operator/                   # KindUpstreamHelm (mixed primary)
│   └── values.yaml
├── 003-gpu-operator-post/              # KindLocalHelm (mixed -post)
│   ├── Chart.yaml
│   ├── templates/
│   └── values.yaml
├── 004-nodewright-customizations/      # KindLocalHelm (manifest-only)
│   ├── Chart.yaml
│   ├── templates/
│   └── values.yaml
└── README.md

Manifest-only components and mixed-component raw manifests are supported by --deployer argocd-helm via the path-based Application shape.

The bundle is URL-portable. No --repo flag is needed (and is ignored if passed with --deployer argocd-helm). The same generated bundle bytes can be pushed to any chart-source backend the user chooses — Argo CD pulls from whichever URL the user supplies at install time via helm install --set repoURL=.... The publish location is not baked into the bundle artifact.

Recommended deploy flow:

$ # 1. Generate the bundle (URL-agnostic)
$ aicr bundle -r recipe.yaml --deployer argocd-helm --dynamic gpuoperator:driver.version -o ./bundle
$ 
$ # 2. Publish to your chart registry (any HTTPS-capable OCI / Helm chart repo)
$ helm package ./bundle -d /tmp/
$ helm push /tmp/aicr-bundle-*.tgz oci://<your-registry>/<path>
$ 
$ # 3. Install — the URL is supplied here, not at bundle time
$ helm install aicr-bundle oci://<your-registry>/<path>/aicr-bundle --version <chart-version> \
>   -n argocd \
>   --set repoURL=oci://<your-registry>/<path>/aicr-bundle \
>   --set targetRevision=<chart-version>

The chart’s templates/aicr-stack.yaml renders the parent Argo Application with .Values.repoURL and .Values.targetRevision substituted in. The parent Application then triggers Argo to render the chart again from the OCI source, creating the per-component child Applications with sync-wave ordering preserved. Child Applications whose source is path-based (manifest-only and mixed-component -post folders) inherit .Values.repoURL so they too pull from the same published location.

helm install ./bundle from a local directory also works, but with a caveat: child Applications whose source is path-based require Argo’s repo-server to fetch the bundle from a remote (git or OCI) — there is no local-filesystem source type for an Argo Application. Local helm install is therefore end-to-end only when the recipe contains pure-Helm components. For everything else, publish first.

Bundle structure (with default Helm deployer):

bundles/
├── README.md                      # Deployment guide with ordered steps
├── deploy.sh                      # Generic install loop + name-matched blocks
├── undeploy.sh                    # Generic reverse loop
├── recipe.yaml                    # Recipe used to generate bundle
├── checksums.txt                  # SHA256 checksums
├── attestation/                   # Present when --attest is used
│   ├── bundle-attestation.sigstore.json   # SLSA Build Provenance v1
│   └── aicr-attestation.sigstore.json     # Binary SLSA provenance chain
├── 001-cert-manager/              # Upstream-helm folder: no Chart.yaml
│   ├── install.sh                 # Rendered: helm upgrade --install ... --repo ${REPO}
│   ├── values.yaml
│   ├── cluster-values.yaml        # Dynamic-path overrides (operator-edited)
│   └── upstream.env               # CHART, REPO, VERSION (sourced by install.sh)
├── 002-gpu-operator/              # Mixed component primary (upstream-helm)
│   ├── install.sh
│   ├── values.yaml
│   ├── cluster-values.yaml
│   └── upstream.env
└── 003-gpu-operator-post/         # Injected -post wrapped chart (mixed component's raw manifests)
    ├── Chart.yaml                 # Local-helm folder: Chart.yaml + templates/ present
    ├── install.sh                 # Rendered: helm upgrade --install ... ./
    ├── values.yaml
    ├── cluster-values.yaml
    └── templates/
        └── dcgm-exporter.yaml

Folder layout rules:

Folders are numbered NNN-<component>/ (1-based, zero-padded). Numbering is regenerated on every bundle.
Each folder is one of two kinds, distinguished by the presence of Chart.yaml:
- upstream-helm — no Chart.yaml; upstream.env carries CHART/REPO/VERSION; install.sh installs the upstream chart.
- local-helm — Chart.yaml + templates/; install.sh installs the local chart (helm upgrade --install <name> ./).
Mixed components (Helm chart + raw manifests) emit two adjacent folders: a primary upstream-helm NNN-<name>/ and an injected (NNN+1)-<name>-post/ local-helm wrapper carrying the raw manifests. Subsequent components shift by one.
Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart.
Kustomize-typed components run kustomize build at bundle time; the output becomes a single templates/manifest.yaml inside a local-helm folder.

Breaking change vs. earlier releases:

Previous releases used a flat <component>/ layout with manifests/ siblings and a --deployer helm script that branched on component kind. The new format is uniform:

All folders carry a rendered install.sh. The top-level deploy.sh is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it.
Raw manifests for mixed components now apply post-install only, via the injected -post wrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively.
Tooling that parsed bundle paths by bare component name must account for the NNN- prefix.

Argo CD bundle structure (with --deployer argocd):

The argocd deployer uses the same uniform NNN-<component>/ folder layout as --deployer helm. Each folder carries an application.yaml whose Application shape is decided by the folder kind:

Chart.yaml absent (KindUpstreamHelm — pure Helm components): today’s multi-source Application pointing at the upstream Helm repository plus a values $ref to the user’s git repo. Unchanged for current users.
Chart.yaml present (KindLocalHelm — manifest-only, kustomize-wrapped, mixed -post): single-source path-based Application with source.path: NNN-<name> against the user’s repo.

The argocd deployer emits only what Argo CD’s repo-server consumes: application.yaml, values.yaml (multi-source helm.valueFiles for upstream-helm, or local-chart Helm rendering for KindLocalHelm), and Chart.yaml/templates/ for KindLocalHelm. The helm-deployer orchestration files (install.sh, upstream.env, cluster-values.yaml) are stripped — Argo doesn’t run shell scripts or source shell env, and --dynamic is rejected with --deployer argocd (use --deployer argocd-helm for install-time values).

bundles/
├── app-of-apps.yaml               # Parent Application (recurses *.application.yaml)
├── 001-cert-manager/              # KindUpstreamHelm — no Chart.yaml
│   ├── values.yaml                # Static Helm values (consumed via multi-source)
│   └── application.yaml           # Multi-source Application (sync-wave 0)
├── 002-gpu-operator/              # KindUpstreamHelm — primary of mixed
│   ├── values.yaml
│   └── application.yaml
├── 003-gpu-operator-post/         # KindLocalHelm — injected mixed -post
│   ├── Chart.yaml                 # Synthesized wrapper for raw manifests
│   ├── templates/                 # Rendered manifests
│   ├── values.yaml
│   └── application.yaml           # Path-based Application (sync-wave 2)
├── 004-nodewright-customizations/ # KindLocalHelm — manifest-only
│   ├── Chart.yaml
│   ├── templates/
│   ├── values.yaml
│   └── application.yaml
└── README.md

Manifest-only components (e.g., nodewright-customizations) and mixed-component raw manifests (the -post injection) are now deployed by --deployer argocd. Previously they were silently dropped. Set --repo <user-git-or-oci> to populate the repoURL on path-based Applications so Argo can resolve them.

Day 2 Options:

The --workload-gate and --workload-selector flags are day 2 operational options for cluster scaling operations:

--workload-gate: Specifies a taint for nodewright-operator’s runtime required feature. This ensures nodes are properly configured before workloads can schedule on them during cluster scaling. The taint is configured in the nodewright-operator Helm values file at controllerManager.manager.env.runtimeRequiredTaint. For more information about runtime required, see the Nodewright documentation.
--workload-selector: Specifies a label selector for nodewright-customizations to prevent nodewright from evicting running training jobs. This is critical for training workloads where job eviction would cause significant disruption. The selector is set in the Skyhook CR manifest (tuning.yaml) in the spec.workloadSelector.matchLabels field.

Estimated node count (--nodes):

The --nodes flag is a bundle-time option: it is applied when you run aicr bundle, not when you run aicr recipe. The value is written to each component’s Helm values at the paths declared in the registry under nodeScheduling.nodeCountPaths.

When to use: Pass the expected or typical number of GPU nodes (e.g. size of your node pool). Use 0 (default) to leave the value unset.
Where it goes: Components that define nodeCountPaths in the registry receive the value at those paths in their generated values.yaml.
Example: aicr bundle -r recipe.yaml --nodes 8 -o ./bundles writes 8 to every path listed in each component’s nodeScheduling.nodeCountPaths.

Component Validation System:

AICR includes a component-driven validation system that automatically checks bundle configuration and displays warnings or errors during bundle generation. Validations are defined in the component registry and run automatically when components are included in a recipe.

How Validations Work:

Automatic Execution: When generating a bundle, validations are automatically executed for each component in the recipe
Condition-Based: Validations can be configured to run only when specific conditions are met (e.g., intent, service, accelerator)
Severity Levels: Each validation can be configured as a “warning” (non-blocking) or “error” (blocking)
Custom Messages: Each validation can include an optional detail message that provides actionable guidance

Validation Warnings:

When generating bundles with nodewright-customizations enabled, validation warnings are displayed for missing configuration:

Workload Selector Warning: When nodewright-customizations is enabled with training intent, if --workload-selector is not set, a warning will be displayed:

Warning: nodewright-customizations is enabled but --workload-selector is not set. 
This may cause nodewright to evict running training jobs. Consider setting --workload-selector to prevent eviction.

Accelerated Selector Warning: When nodewright-customizations is enabled with training or inference intent, if --accelerated-node-selector is not set, a warning will be displayed:

Warning: nodewright-customizations is enabled but --accelerated-node-selector is not set. 
Without this selector, the customization will run on all nodes. Consider setting --accelerated-node-selector to target specific nodes.

Viewing Validation Warnings:

Validation warnings are displayed in the bundle output after successful generation:

$ Note:
$   ⚠ Warning: nodewright-customizations is enabled but --workload-selector is not set. This may cause nodewright to evict running training jobs. Consider setting --workload-selector to prevent eviction.
$   ⚠ Warning: nodewright-customizations is enabled but --accelerated-node-selector is not set. Without this selector, the customization will run on all nodes. Consider setting --accelerated-node-selector to target specific nodes.

Resolving Validation Warnings:

To resolve the warnings, include the appropriate flags when generating the bundle:

$ # Resolve workload selector warning
$ aicr bundle -r recipe.yaml \
>   --workload-selector workload-type=training \
>   -o ./bundle
$ 
$ # Resolve accelerated selector warning
$ aicr bundle -r recipe.yaml \
>   --accelerated-node-selector nodeGroup=gpu-worker \
>   -o ./bundle
$ 
$ # Resolve both warnings
$ aicr bundle -r recipe.yaml \
>   --workload-selector workload-type=training \
>   --accelerated-node-selector nodeGroup=gpu-worker \
>   -o ./bundle

Examples:

$ # Generate bundle with day 2 options for training workloads
$ aicr bundle -r recipe.yaml \
>   --workload-gate skyhook.nvidia.com/runtime-required=true:NoSchedule \
>   --workload-selector workload-type=training \
>   --workload-selector intent=training \
>   --accelerated-node-selector accelerator=nvidia-h100 \
>   -o ./bundles
$ 
$ # Generate bundle for inference workloads with accelerated selector
$ aicr bundle -r recipe.yaml \
>   --accelerated-node-selector accelerator=nvidia-h100 \
>   -o ./bundles

Argo CD Applications use multi-source to:

Pull Helm charts from upstream repositories
Apply values.yaml from your GitOps repository
Deploy additional manifests from component’s manifests/ directory (if present)

Bundle Attestation

Prerequisite: The --attest flag requires a binary installed using the install script, which includes a cryptographic attestation from NVIDIA. Binaries installed via go install or manual download do not include this file and cannot use --attest.

When --attest is passed, the bundle command performs five steps:

Verifies the binary attestation file exists — The running aicr binary must have a valid SLSA provenance file (aicr-attestation.sigstore.json) alongside it, included by the install script from a release archive. If missing, the command fails immediately with guidance on how to install correctly.
Acquires an OIDC token — see OIDC Token Sources below.
Verifies the binary’s own attestation — Cryptographically verifies the SLSA provenance binds to the running binary and was signed by NVIDIA CI. This ensures only NVIDIA-built binaries can produce attested bundles.
Signs the bundle — Creates a SLSA Build Provenance v1 in-toto statement binding the creator’s identity to the bundle content (via checksums.txt digest) and the binary that produced it.
Writes attestation files — attestation/bundle-attestation.sigstore.json and attestation/aicr-attestation.sigstore.json are added to the bundle output.

Attestation is opt-in; bundles are unsigned by default. Signing uses Sigstore keyless signing (Fulcio CA + Rekor transparency log). For verification, see aicr verify.

OIDC Token Sources

--attest resolves an OIDC identity token from the first matching source, in order:

--identity-token flag (or COSIGN_IDENTITY_TOKEN env) — a pre-fetched token. Use this when a token is obtained out of band (e.g., from a cloud workload-identity exchange or another cosign invocation). On shared hosts prefer the env var: a flag value is visible in ps and /proc/<pid>/cmdline to any user on the same machine.
ACTIONS_ID_TOKEN_REQUEST_URL + ACTIONS_ID_TOKEN_REQUEST_TOKEN — the ambient GitHub Actions OIDC credential. Used automatically in CI.
--oidc-device-flow flag (or AICR_OIDC_DEVICE_FLOW env) — OAuth 2.0 Device Authorization Grant (RFC 8628). The CLI prints a verification URL and short code; the user enters the code in a browser on a separate device. Use on headless hosts (bastions, remote build boxes) where the default browser callback cannot reach the machine running aicr. The host still needs outbound network access to Sigstore’s OIDC and signing endpoints.
Interactive browser flow — opens the default browser and listens on a random localhost port for the redirect. Default on workstations.

Both interactive flows time out after 5 minutes.

Attestation works with all deployers (helm, argocd, argocd-helm). External --data files are included in checksums.txt and listed as resolved dependencies in the attestation.

Attestation Scope

Attestation binds the shipped bundle — defaults, dynamic-value stubs, and any external --data files copied into the bundle. It does not bind install-time values supplied via helm --set, a user-provided -f extra.yaml, or Argo Application.spec.source.helm.parameters. That boundary is intentional: dynamic values are the operator’s domain by design.

If you need to enforce specific install-time values (e.g., pinning driver.version), that is a policy concern, not an attestation one. Use admission control (Kyverno, Gatekeeper) or Argo sync hooks to reject deployments that violate the policy. aicr verify checks bundle integrity and provenance; it does not evaluate install-time value constraints.

Deploying a bundle

$ # Navigate to bundle
$ cd bundles
$ 
$ # Review root README and a component's values
> cat README.md
> cat 001-gpu-operator/values.yaml
> 
> # Verify integrity
> sha256sum -c checksums.txt
> 
> # Deploy to cluster
> chmod +x deploy.sh && ./deploy.sh

Note: deploy.sh and undeploy.sh are convenience scripts — not the only deployment path. Each NNN-<component>/ folder contains a rendered install.sh that runs the exact helm upgrade --install command for manual or pipeline-driven deployment.

Deploy Script Behavior (`deploy.sh`)

The deploy script installs components in the order specified by deploymentOrder in the recipe.

Flags:

Flag	Description
`--no-wait`	Skip Helm chart-level wait (`helm --wait`) where AICR uses it. Keeps `--timeout` for hooks.
`--best-effort`	Continue past individual component failures instead of exiting
`--retries N`	Retry failed helm/kubectl operations N times with exponential backoff (default: 5)

Unknown flags are rejected with an error to catch typos (e.g., --best-effort).

Note on install completion vs. workload readiness. By default, deploy.sh waits on Helm chart readiness where AICR uses helm --wait. Some components are intentionally installed without Helm chart-level waiting, and the script does not wait for bundle-level workload readiness such as Nodewright node tuning, GPU operator operand rollout (driver, toolkit, device-plugin DaemonSets), or NVIDIA DRA kubelet plugin registration. Those continue asynchronously after the script exits. When --best-effort is used, the script may also finish with non-fatal component failures; check warning lines and logs before treating the install/apply pass as fully successful. --no-wait only skips the Helm chart-level wait where AICR uses it; it does not affect bundle-level convergence.

Retry behavior:

The deploy script retries failed helm upgrade --install and kubectl apply operations with exponential backoff. By default, each operation is retried up to 5 times (6 total attempts). The backoff delay increases quadratically: 5s, 20s, 45s, 80s, 120s (capped) between retries.

Use --retries 0 to disable retries (fail-fast behavior). When --best-effort is also set, retries are exhausted first before falling through to best-effort handling.

Pre-install manifests and CRD ordering:

Some components have pre-install manifests (CRDs, namespaces, ConfigMaps) that must exist before helm install. The script applies these with kubectl apply before the Helm install. On first deploy, CRD-dependent resources may produce no matches for kind warnings because the CRD hasn’t been registered yet — these warnings are suppressed. All other kubectl apply errors (auth failures, webhook denials, bad manifests) fail the script immediately.

After helm install, the same manifests are re-applied as post-install to ensure CRD-dependent resources are created.

Async components:

Components that use operator patterns with custom resources that reconcile asynchronously (e.g., kai-scheduler) are installed without --wait to avoid Helm timing out on CR readiness.

DRA kubelet plugin registration

After installing nvidia-dra-driver-gpu, the script automatically restarts the DRA kubelet plugin daemonset. This is a best-effort mitigation for a known issue: after uninstall/reinstall, the kubelet’s plugin watcher (fsnotify) may not detect new registration sockets, causing DRA driver gpu.nvidia.com is not registered errors.

If DRA pods fail with this error after redeployment, the daemonset restart alone may not be sufficient — a node reboot is required to reset the kubelet’s plugin registration state. To reboot GPU nodes:

$ # Cordon, drain, and reboot the affected node
$ kubectl cordon <node-name>
$ kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
$ # Reboot via cloud provider (e.g., AWS EC2 console or CLI)
$ aws ec2 reboot-instances --instance-ids <instance-id>
$ # Uncordon after node returns
$ kubectl uncordon <node-name>

Undeploy Script Behavior (`undeploy.sh`)

The undeploy script removes components in reverse deployment order.

Flags:

Flag	Description
`--keep-namespaces`	Skip namespace deletion after component removal
`--delete-pvcs`	Delete all PVCs in component namespaces (default: off)
`--skip-preflight`	Skip pre-flight CRD/finalizer checks (use with caution)
`--timeout SECONDS`	Helm uninstall timeout per component (default: 120)

PVC preservation (default):

PVCs are not deleted by default. This preserves historical data (Prometheus metrics, Alertmanager state, etcd data) across redeploys. If an EBS-backed PV has an AZ mismatch after redeployment, the PVC will stay Pending with a clear error — the operator can then decide to delete it manually.

Pass --delete-pvcs to delete all PVCs. Protected namespaces (kube-system, kube-public, kube-node-lease, default) are always excluded from PVC deletion to prevent accidental removal of non-bundle PVCs.

Shared namespace ordering:

When multiple components share a namespace (e.g., monitoring contains kube-prometheus-stack, prometheus-adapter, and k8s-ephemeral-storage-metrics), all components are uninstalled first, then PVC and namespace cleanup runs once. This prevents hangs caused by kubernetes.io/pvc-protection finalizers — if a StatefulSet owner is still running when PVC deletion is attempted, the delete blocks indefinitely.

Stuck release handling:

If a Helm release is in a pending-install or pending-upgrade state (from an interrupted deploy), the script retries with --no-hooks to force removal.

Orphaned webhook cleanup:

After uninstalling each component, the script checks for orphaned validating/mutating webhooks whose backing service no longer exists. Fail-closed webhooks with missing services block all pod creation, so these are deleted proactively.

aicr verify

Verify the integrity and attestation chain of a bundle. Verification is fully offline — no network calls are made.

Synopsis:

$ aicr verify <bundle-dir> [flags]

Flags:

Flag	Type	Default	Description
`--min-trust-level`	string	`max`	Minimum required trust level. `max` auto-detects the highest achievable level and verifies against it. Explicit levels: `verified`, `attested`, `unverified`, `unknown`.
`--require-creator`	string		Require a specific creator identity, matched against the bundle attestation signing certificate.
`--cli-version-constraint`	string		Version constraint for the aicr CLI version in the attestation predicate. Supports `>=`, `>`, `<=`, `<`, `==`, `!=`. A bare version (e.g. `"0.8.0"`) defaults to `>=`.
`--certificate-identity-regexp`	string		Override the certificate identity pattern for binary attestation verification. Must contain `"NVIDIA/aicr"`. For testing only.
`--format`	string	`text`	Output format: `text` or `json`.

Trust Levels

Level	Name	Criteria
4	`verified`	Full chain: checksums + bundle attestation + binary attestation pinned to NVIDIA CI
3	`attested`	Chain verified but binary attestation missing or external data (`--data`) was used
2	`unverified`	Checksums valid, `--attest` was not used when creating the bundle
1	`unknown`	Missing or invalid checksums

Verification steps

Checksums — verifies all content files match checksums.txt
Bundle attestation — cryptographic signature verified against Sigstore trusted root
Binary attestation — provenance chain verified with identity pinned to NVIDIA CI (on-tag.yaml workflow)

Examples:

$ # Auto-detect maximum trust level
$ aicr verify ./my-bundle
$ 
$ # Enforce a minimum trust level
$ aicr verify ./my-bundle --min-trust-level verified
$ 
$ # Require a specific bundle creator
$ aicr verify ./my-bundle --require-creator jdoe@company.com
$ 
$ # Require minimum CLI version used to create the bundle
$ aicr verify ./my-bundle --cli-version-constraint ">= 0.8.0"
$ 
$ # JSON output for CI pipelines
$ aicr verify ./my-bundle --format json

Stale root: If verification fails with certificate chain errors, run aicr trust update to refresh the Sigstore trusted root.

aicr trust update

Fetch the latest Sigstore trusted root from the TUF CDN and update the local cache at ~/.sigstore/root/. This is needed when Sigstore rotates signing keys (a few times per year).

Synopsis:

$ aicr trust update

No flags. This command contacts tuf-repo-cdn.sigstore.dev, verifies the update chain against the embedded TUF root, and writes the result to ~/.sigstore/root/.

When to run:

After initial installation (the install script runs this automatically)
When aicr verify reports a stale or expired trusted root
When Sigstore announces key rotation

Example:

$ aicr trust update

aicr skill

Generate an AI agent skill file that teaches a coding agent how to use the AICR CLI. The generated file is written to the agent’s standard configuration directory.

Synopsis:

$ aicr skill --agent <agent> [flags]

Flags:

Flag	Type	Default	Description
`--agent`	string	(required)	Target coding agent: `claude-code`, `codex`
`--stdout`	bool	false	Print to stdout instead of writing to disk
`--force`	bool	false	Overwrite an existing skill file without prompting

Install Locations:

Agent	Path
`claude-code`	`~/.claude/skills/aicr/SKILL.md`
`codex`	`~/.codex/skills/aicr/SKILL.md`

Behavior:

Without --stdout: writes the file to disk and prints the path
With --stdout: prints the generated content to stdout
If the target file already exists: prompts overwrite? [y/N] when stdin is a terminal; aborts on non-interactive stdin unless --force is set
Creates parent directories as needed

Examples:

$ # Install Claude Code skill file
$ aicr skill --agent claude-code
$ 
$ # Install Codex skill file
$ aicr skill --agent codex
$ 
$ # Overwrite an existing skill file without prompting (e.g., in CI)
$ aicr skill --agent claude-code --force
$ 
$ # Print to stdout (e.g., for review before installing)
$ aicr skill --agent claude-code --stdout

Complete Workflow Examples

File-Based Workflow

$ # Step 1: Capture system configuration
$ aicr snapshot --output snapshot.yaml
$ 
$ # Step 2: Generate optimized recipe for training workloads
$ aicr recipe \
>   --snapshot snapshot.yaml \
>   --intent training \
>   --output recipe.yaml
$ 
$ # Step 3: Validate recipe constraints against snapshot
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot snapshot.yaml
$ 
$ # Step 4: Create deployment bundle
$ aicr bundle \
>   --recipe recipe.yaml \
>   --output ./deployment
$ 
$ # Step 5: Deploy to cluster
$ cd deployment && chmod +x deploy.sh && ./deploy.sh
$ 
$ # Step 6: Verify deployment
$ kubectl get pods -n gpu-operator
$ kubectl logs -n gpu-operator -l app=nvidia-operator-validator

ConfigMap-Based Workflow (Kubernetes-Native)

$ # Step 1: Agent captures snapshot to ConfigMap (using CLI deployment)
$ aicr snapshot --output cm://gpu-operator/aicr-snapshot
$ 
$ # The CLI handles agent deployment automatically
$ # No manual kubectl steps needed
$ 
$ # Step 2: Generate recipe from ConfigMap
$ aicr recipe \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --intent training \
>   --output recipe.yaml
$ 
$ # Alternative: Write recipe to ConfigMap as well
$ aicr recipe \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --intent training \
>   --output cm://gpu-operator/aicr-recipe
$ 
$ # With custom kubeconfig (if not using default)
$ aicr recipe \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --kubeconfig ~/.kube/prod-cluster \
>   --intent training \
>   --output recipe.yaml
$ 
$ # Step 3: Validate recipe constraints against cluster snapshot
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot cm://gpu-operator/aicr-snapshot
$ 
$ # For CI/CD pipelines: exit non-zero on validation failure
$ aicr validate \
>   --recipe recipe.yaml \
>   --snapshot cm://gpu-operator/aicr-snapshot \
>   --fail-on-error
$ 
$ # Step 4: Create bundle from recipe
$ aicr bundle \
>   --recipe recipe.yaml \
>   --output ./deployment
$ 
$ # Step 5: Deploy to cluster
$ cd deployment && chmod +x deploy.sh && ./deploy.sh
$ 
$ # Step 6: Verify deployment
$ kubectl get pods -n gpu-operator
$ kubectl logs -n gpu-operator -l app=nvidia-operator-validator

E2E Testing

Validate the complete workflow:

$ # Run all CLI integration tests (no cluster needed)
$ make e2e
$ 
$ # Run a single chainsaw test
$ AICR_BIN=$(find dist -maxdepth 2 -type f -name aicr | head -n 1)
$ chainsaw test --no-cluster --test-dir tests/chainsaw/cli/recipe-generation

Shell Completion

Generate shell completion scripts:

$ # Bash
$ aicr completion bash
$ 
$ # Zsh
$ aicr completion zsh
$ 
$ # Fish
$ aicr completion fish
$ 
$ # PowerShell
$ aicr completion pwsh

Installation:

Bash:

$ source <(aicr completion bash)
$ # Or add to ~/.bashrc for persistence
$ echo 'source <(aicr completion bash)' >> ~/.bashrc

Zsh:

$ source <(aicr completion zsh)
$ # Or add to ~/.zshrc
$ echo 'source <(aicr completion zsh)' >> ~/.zshrc

Environment Variables

AICR respects standard environment variables:

Variable	Description	Default
`KUBECONFIG`	Path to Kubernetes config file	`~/.kube/config`
`AICR_LOG_LEVEL`	Logging level: debug, info, warn, error	info
`AICR_LOG_PREFIX`	Override the CLI logger prefix	`cli`
`NO_COLOR`	Suppress ANSI color codes in CLI logger output (de-facto standard, see no-color.org)	unset

Exit Codes

Code	Meaning
0	Success
1	General error (unclassified)
2	Invalid input (bad arguments, validation failure)
3	Not found (requested resource does not exist)
4	Unauthorized (authentication or authorization failure)
5	Timeout (operation exceeded time limit)
6	Unavailable (service temporarily unavailable)
7	Rate limited (client exceeded rate limit)
8	Internal error (unexpected failure)

Common Usage Patterns

Quick Recipe Generation

$ aicr recipe --os ubuntu --accelerator h100 | jq '.componentRefs[]'

Save All Steps

$ aicr snapshot -o snapshot.yaml
$ aicr recipe -s snapshot.yaml --intent training -o recipe.yaml
$ aicr bundle -r recipe.yaml -o ./bundles

JSON Processing

$ # Extract GPU Operator version from recipe
$ aicr recipe --os ubuntu --accelerator h100 --format json | \
>   jq -r '.componentRefs[] | select(.name=="gpu-operator") | .version'
$ 
$ # Get all component versions
$ aicr recipe --os ubuntu --accelerator h100 --format json | \
>   jq -r '.componentRefs[] | "\(.name): \(.version)"'

Multiple Environments

$ # Generate recipes for different cloud providers
$ for service in eks gke aks; do
$   aicr recipe --os ubuntu --service $service --gpu h100 \
>     --output recipe-${service}.yaml
$ done

Troubleshooting

Snapshot Fails

$ # Check GPU drivers
$ nvidia-smi
$ 
$ # Check Kubernetes access
$ kubectl cluster-info
$ 
$ # Run with debug
$ aicr --debug snapshot

Recipe Not Found

$ # Query parameters may not match any overlay
$ # Try broader query:
$ aicr recipe --os ubuntu --gpu h100

Bundle Generation Fails

$ # Verify recipe file
$ cat recipe.yaml
$ 
$ # Check bundler is valid
$ aicr bundle --help  # Shows available bundlers
$ 
$ # Run with debug
$ aicr --debug bundle -r recipe.yaml

External Data Directory

The --data flag enables extending or overriding the embedded recipe data with external files. This allows customization without rebuilding the CLI.

Overview

AICR embeds recipe data (overlays, component values, registry) at compile time. The --data flag layers an external directory on top, enabling:

Custom components: Add new components to the registry
Override values: Replace default component values files
Custom overlays: Add new recipe overlays for specific environments
Registry extensions: Add custom components while preserving embedded ones

Directory Structure

The external directory must mirror the embedded data structure:

my-data/
├── registry.yaml          # REQUIRED - merged with embedded registry
├── overlays/
│   └── base.yaml              # Optional - replaces embedded base.yaml
│   └── custom-overlay.yaml    # Optional - adds new overlay
└── components/
    └── gpu-operator/
        └── values.yaml        # Optional - replaces embedded values

Requirements

registry.yaml is required: The external directory must contain a registry.yaml file
Security validations: Symlinks are rejected, file size is limited (10MB default)
No path traversal: Paths containing .. are rejected

Merge Behavior

File Type	Behavior
`registry.yaml`	Merged - External components are added to embedded; same-named components are replaced
All other files	Replaced - External file completely replaces embedded if path matches

Usage Examples

$ # Use external data directory for recipe generation
$ aicr recipe --service eks --accelerator h100 --data ./my-data
$ 
$ # Use external data directory for bundle generation
$ aicr bundle --recipe recipe.yaml --data ./my-data --output ./bundles
$ 
$ # Combine with other flags
$ aicr recipe --service eks --gpu gb200 --intent training \
>   --data ./custom-recipes \
>   --output recipe.yaml

Example: Adding a Custom Component

Create external data directory:

$ mkdir -p my-data/components/my-operator

Create registry.yaml with custom component:

1 # my-data/registry.yaml
2 apiVersion: aicr.nvidia.com/v1alpha1
3 kind: ComponentRegistry
4 components:
5   - name: my-operator
6     displayName: My Custom Operator
7     helm:
8       defaultRepository: https://my-charts.example.com
9       defaultChart: my-operator
10       defaultVersion: v1.0.0

Create values file for the component:

1 # my-data/components/my-operator/values.yaml
2 replicaCount: 1
3 image:
4   repository: my-registry/my-operator
5   tag: v1.0.0

Create overlay that includes the component:

1 # my-data/overlays/my-custom-overlay.yaml
2 kind: RecipeMetadata
3 apiVersion: aicr.nvidia.com/v1alpha1
4 metadata:
5   name: my-custom-overlay
6 spec:
7   criteria:
8     service: eks
9     intent: training
10   componentRefs:
11     - name: my-operator
12       type: Helm
13       valuesFile: components/my-operator/values.yaml

Generate recipe with external data:

$ aicr recipe --service eks --intent training --data ./my-data

Debugging External Data

Use --debug flag to see detailed logging about external data loading:

$ aicr --debug recipe --service eks --data ./my-data

Debug logs include:

External files discovered and registered
File source resolution (embedded vs external)
Registry merge details (components added/overridden)

Example Files

The examples/ directory contains reference files for testing and learning:

Recipes (`examples/recipes/`)

File	Description
`kind.yaml`	Recipe for local Kind cluster with fake GPU
`eks-training.yaml`	EKS recipe optimized for training workloads
`eks-gb200-ubuntu-training-with-validation.yaml`	GB200 on EKS with Ubuntu and multi-phase validation

Usage:

$ # Generate bundle from example recipe
$ aicr bundle --recipe examples/recipes/eks-training.yaml --output ./bundles

Templates (`examples/templates/`)

File	Description
`snapshot-template.md.tmpl`	Go template for custom snapshot report formatting

Usage:

$ # Generate custom cluster report
$ aicr snapshot --template examples/templates/snapshot-template.md.tmpl --output report.md

Overview

Global Flags

Logging Modes

Commands

aicr snapshot

Custom Templates

Agent Deployment Mode

ConfigMap Output

aicr recipe

Config File Mode (Recommended)

Query Mode

Snapshot Mode

aicr query

Selector Syntax

aicr validate

Validation Phases

Constraint Format

Supported Operators

Workload Scheduling

aicr bundle

Bundle Config File Mode

Node Scheduling

Storage Class (--storage-class)

Deployment Methods (--deployer)

Value Overrides (--set)

Dynamic Install-Time Values

Bundle Attestation

OIDC Token Sources

Attestation Scope

Deploying a bundle

Deploy Script Behavior (deploy.sh)

DRA kubelet plugin registration

Undeploy Script Behavior (undeploy.sh)

aicr verify

Trust Levels

Verification steps

aicr trust update

aicr skill

Complete Workflow Examples

File-Based Workflow

ConfigMap-Based Workflow (Kubernetes-Native)

E2E Testing

Shell Completion

Environment Variables

Exit Codes

Common Usage Patterns

Quick Recipe Generation

Save All Steps

JSON Processing

Multiple Environments

Troubleshooting

Snapshot Fails

Recipe Not Found

Bundle Generation Fails

External Data Directory

Overview

Directory Structure

Requirements

Merge Behavior

Usage Examples

Example: Adding a Custom Component

Debugging External Data

Example Files

Recipes (examples/recipes/)

Templates (examples/templates/)

See Also

Storage Class (`--storage-class`)

Deployment Methods (`--deployer`)

Value Overrides (`--set`)

Deploy Script Behavior (`deploy.sh`)

Undeploy Script Behavior (`undeploy.sh`)

Recipes (`examples/recipes/`)

Templates (`examples/templates/`)