CLI | NVIDIA AI Cluster Runtime

The aicr CLI provides command-line access to AICR configuration management capabilities.

Overview

The CLI provides a four-step workflow for optimizing GPU infrastructure, plus a query command for inspecting hydrated recipe values:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Snapshot   │─────▶│    Recipe    │─────▶│   Validate   │─────▶│    Bundle    │
└──────────────┘      └──────────────┘      └──────────────┘      └──────────────┘
   Capture system      Generate optimized    Check cluster         Create deployment
   configuration        recommendations       compatibility         artifacts
                              │
                        ┌─────┴──────┐
                        │   Query    │
                        └────────────┘
                        Extract hydrated
                        config values

Step 1: Snapshot Command

Captures system configuration:

Operating system: grub, kmod, sysctl, /etc/os-release
SystemD services: containerd, docker, kubelet (service state and configuration)
Kubernetes: API server version, container images, ClusterPolicy custom resource
GPU hardware: driver version, CUDA libraries, MIG configuration, device properties
Node topology (cluster-wide taints and labels)

Output destinations:

File: --output system.yaml (local filesystem)
Stdout: Default (can be piped to other commands)
ConfigMap: --output cm://namespace/name (Kubernetes ConfigMap using Kubernetes API)

Agent deployment:

Kubernetes Job runs on GPU nodes. Writes snapshot to ConfigMap via Kubernetes API. Requires ServiceAccount with ConfigMap create/update permissions (Role in target namespace). Does not require PersistentVolume.

Step 2: Recipe Command

Generates optimized configuration recipes with two modes:

Query Mode: Direct recipe generation from system parameters (OS, GPU, K8s, etc.)
Snapshot Mode: Analyzes captured snapshots and generates tailored recipes based on workload intent (training/inference)

Input Options:

Query parameters: --os ubuntu --gpu gb200 --service eks (direct recipe generation)
Snapshot file: --snapshot system.yaml (analyze captured snapshot)
ConfigMap: --snapshot cm://namespace/name (read from Kubernetes)

Output Options:

File: --output recipe.yaml (write to file)
Stdout: Default behavior (pipe to bundle command)
ConfigMap: --output cm://namespace/name (store in Kubernetes)

Step 3: Validate Command

Validates recipe constraints against actual system measurements from a snapshot.

Input sources:

Recipe file: --recipe recipe.yaml (local filesystem)
Recipe URL: --recipe https://example.com/recipe.yaml (HTTP/HTTPS)
Recipe ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)
Snapshot file: --snapshot snapshot.yaml (local filesystem)
Snapshot ConfigMap: --snapshot cm://namespace/name (Kubernetes ConfigMap)

Constraint format:

Constraints use fully qualified measurement paths: \{Type\}.\{Subtype\}.\{Key\}

K8s.server.version - Kubernetes server version
OS.release.ID - Operating system identifier
OS.release.VERSION_ID - OS version
OS.sysctl./proc/sys/kernel/osrelease - Kernel version

Supported operators:

>= 1.30 - Greater than or equal (version comparison)
<= 1.33 - Less than or equal (version comparison)
> 1.30, < 2.0 - Strict comparison
== ubuntu, != rhel - Equality operators
ubuntu - Exact string match (no operator)

Output:

Validation result with summary (passed/failed/skipped counts)
Individual constraint results with expected vs actual values
Status: pass, fail, or partial (some skipped)

CI/CD integration:

By default, the command exits with non-zero status when constraints fail (ideal for CI/CD). To run in informational mode without failing:

$ aicr validate -r recipe.yaml -s cm://gpu-operator/aicr-snapshot --fail-on-error=false

Step 4: Bundle Command

Generates deployment artifacts from recipes:

Helm values files (values.yaml)
Kubernetes manifests (ClusterPolicy, NICClusterPolicy, etc.)
SHA256 checksum file
README documentation: root bundle/README.md is generated by the deployer; per-component bundle/<component>/README.md is generated by each component bundler

Input sources:

Recipe file: --recipe recipe.yaml (local filesystem)
ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)

Output: Local directory only. ConfigMap output is not supported for bundles.

Current bundlers:

GPU Operator: Generates GPU Operator Helm values and ClusterPolicy manifest
Network Operator: Generates Network Operator Helm values and NICClusterPolicy manifest
Cert-Manager: Generates cert-manager Helm values for certificate management
NVSentinel: Generates NVSentinel Helm values
Nodewright: Generates Nodewright Operator Helm values and Nodewright CR manifest for node optimization

Value overrides:

The --set flag allows runtime customization of generated bundle values:

$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:gds.enabled=true \
>   --set gpuoperator:driver.version=570.86.16

Node scheduling options:

The bundle command supports node selector and toleration flags for controlling workload placement:

$ # Schedule system components (operators, controllers) on specific nodes
$ aicr bundle -r recipe.yaml \
>   --system-node-selector nodeGroup=system-pool \
>   --system-node-toleration dedicated=system:NoSchedule
$ 
$ # Schedule GPU workloads (drivers, device plugins) on GPU nodes
$ aicr bundle -r recipe.yaml \
>   --accelerated-node-selector nvidia.com/gpu.present=true \
>   --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule

Flags:

--system-node-selector key=value – Node selector for system components (repeatable)
--system-node-toleration key=value:effect – Toleration for system components (repeatable)
--accelerated-node-selector key=value – Node selector for GPU nodes (repeatable)
--accelerated-node-toleration key=value:effect – Toleration for GPU nodes (repeatable)
--nodes N – Estimated number of GPU nodes (bundle-time only; written to paths in registry under nodeScheduling.nodeCountPaths)

These flags apply selectors/tolerations to bundler-specific paths (e.g., GPU Operator uses operator.nodeSelector and daemonsets.nodeSelector). The --nodes value is applied to paths listed in the registry under nodeScheduling.nodeCountPaths.

Air-gap vendoring:

--vendor-charts pulls upstream Helm chart bytes into the bundle at bundle time, producing a self-contained artifact that eliminates Helm chart registry egress during deployment (container-image pulls and other resources may still require network access). Each vendored chart is recorded in provenance.yaml at the bundle root with name, version, source URL, and SHA256. Requires the helm binary on $PATH at bundle time; see the CLI reference for the full tradeoff (CVE-yank signal loss, bundle-size cost, auth surface).

Execution model:

Bundlers run concurrently (parallel execution)
All components from the recipe are bundled automatically
Errors from any bundler cause immediate cancellation via context propagation

Testing: End-to-end workflow validated by Chainsaw tests in tests/chainsaw/cli/

Architecture Diagram

ConfigMap Integration

The CLI supports Kubernetes-native ConfigMap storage using the cm://namespace/name URI scheme:

Benefits:

No file dependencies - Direct Kubernetes API integration
Agent-friendly - Jobs can write snapshots without volumes
Pipeline integration - CI/CD can read/write ConfigMaps
Multi-cluster - Share snapshots/recipes across clusters

RBAC Requirements:

ConfigMap read/write permissions in target namespace
ServiceAccount with appropriate Role/RoleBinding
See Agent Deployment for details

Component Details

Entry Point: `cmd/aicr/main.go`

Minimal entry point that delegates to the CLI package:

1 package main
2 
3 import "github.com/NVIDIA/aicr/pkg/cli"
4 
5 func main() \{
6     cli.Execute()
7 \}

Root Command: `pkg/cli/root.go`

Responsibilities:

Command registration and routing
Version information injection (via ldflags)
Global flag handling (debug mode, log formatting)
Logging mode selection and initialization

Key Features:

Version info: version, commit, date (overridden at build time)
Three logging modes:
- CLI Mode (default): Minimal output for users (SetDefaultCLILogger)
- Text Mode (--debug): Full metadata for debugging (SetDefaultLoggerWithLevel)
- JSON Mode (--log-json): Structured logs for automation (SetDefaultStructuredLoggerWithLevel)

Logger selection logic:

1 switch \{
2 case c.Bool("log-json"):
3     logging.SetDefaultStructuredLoggerWithLevel(name, version, logLevel)
4 case isDebug:
5     logging.SetDefaultLoggerWithLevel(name, version, logLevel)
6 default:
7     logging.SetDefaultCLILogger(logLevel)
8 \}

Shell completion support
Command listing for auto-completion

Snapshot Command: `pkg/cli/snapshot.go`

Captures comprehensive system configuration snapshots.

Command Flow

Detailed Data Flow

Snapshot measurement types: K8s, SystemD, OS, GPU, NodeTopology (cluster-wide node taints and labels — see pkg/measurement/types.go for the canonical constants).

Usage Examples

$ # Output to stdout in JSON format
$ aicr snapshot
$ 
$ # Save to file in YAML format
$ aicr snapshot --output system.yaml --format yaml
$ 
$ # Human-readable table format
$ aicr snapshot --format table
$ 
$ # ConfigMap output (Kubernetes-native)
$ aicr snapshot --output cm://gpu-operator/aicr-snapshot

Agent Deployment Pattern

The snapshot command can be deployed as a Kubernetes Job for automated cluster auditing:

Deployment:

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: aicr
5   namespace: gpu-operator
6 spec:
7   template:
8     spec:
9       serviceAccountName: aicr
10       containers:
11       - name: aicr
12         image: ghcr.io/nvidia/aicr:latest
13         command:
14         - aicr
15         - snapshot
16         - --output
17         - cm://gpu-operator/aicr-snapshot
18       restartPolicy: Never

RBAC Requirements:

1 apiVersion: v1
2 kind: ServiceAccount
3 metadata:
4   name: aicr
5   namespace: gpu-operator
6 ---
7 apiVersion: rbac.authorization.k8s.io/v1
8 kind: Role
9 metadata:
10   name: aicr
11   namespace: gpu-operator
12 rules:
13 - apiGroups: [""]
14   resources: ["configmaps"]
15   verbs: ["get", "list", "create", "update", "patch"]
16 ---
17 apiVersion: rbac.authorization.k8s.io/v1
18 kind: RoleBinding
19 metadata:
20   name: aicr
21   namespace: gpu-operator
22 roleRef:
23   apiGroup: rbac.authorization.k8s.io
24   kind: Role
25   name: aicr
26 subjects:
27 - kind: ServiceAccount
28   name: aicr
29   namespace: gpu-operator  # Must match ServiceAccount namespace

Key Points:

No volumes needed - writes directly via Kubernetes API
RBAC RoleBinding must reference correct namespace
ConfigMap automatically created if doesn’t exist
Supports update pattern (overwrite existing snapshots)
RBAC and Job resources are created programmatically by pkg/k8s/agent

Recipe Command: `pkg/cli/recipe.go`

Generates optimized configuration recipes based on environment parameters.

Command Flow

Detailed Data Flow

Recipe Matching Algorithm

The recipe matching uses an asymmetric rule-based query system where overlay criteria (rules) match against user queries (candidates):

1 # Overlay file (eks.yaml)
2 spec:
3   criteria:
4     service: eks          # Rule: query must have service=eks
5                          # Other fields empty = wildcards (match any query value)

Asymmetric Matching Rules:

All non-empty fields in the overlay criteria must be satisfied by the query
Empty overlay field → Wildcard (matches any query value)
Query “any” field → Only matches overlay “any” (does NOT match specific overlays)
Version fields use semantic version equality with precision awareness

This asymmetric behavior ensures generic queries (e.g., --service eks --intent training) don’t match overly specific recipes (e.g., recipes requiring accelerator: gb200).

Usage Examples

$ # Basic recipe for Ubuntu with gb200 GPU
$ aicr recipe --os ubuntu --gpu gb200
$ 
$ # Full specification with all parameters
$ aicr recipe \
>   --service eks \
>   --accelerator gb200 \
>   --intent training \
>   --os ubuntu \
>   --nodes 8 \
>   --format yaml \
>   --output recipe.yaml
$ 
$ # Inference workload on GKE  
$ aicr recipe --service gke --gpu gb200 --intent inference
$ 
$ # Snapshot mode - analyze captured snapshot for training
$ aicr recipe --snapshot system.yaml --intent training
$ 
$ # Snapshot mode - analyze for inference optimization
$ aicr recipe \
>   --snapshot cluster-snapshot.yaml \
>   --intent inference \
>   --format yaml \
>   --output recipe.yaml

Recipe Command Modes

The recipe command supports two modes of operation:

Query Mode (Default)

Direct recipe generation from environment parameters:

Snapshot Mode

Analyze captured snapshots and generate tailored recipes:

Query Extraction from Snapshot

When using snapshot mode, the recipe builder extracts environment parameters from the snapshot:

From OS Measurements:

release subtype → OS family (ubuntu, rhel, cos, amazonlinux, talos)

From Kubernetes Measurements:

server subtype → K8s service provider (eks, gke, aks) inferred from images

From GPU Measurements:

Product Name → GPU type detection (H100, GB200, B200, A100, L40, RTX PRO 6000)
Maps product names to normalized accelerator types for recipe matching

Intent Types:

training – Optimize for high throughput, batch processing, multi-GPU orchestration
inference – Optimize for low latency, single-request performance, efficient batching
any – Provides general-purpose recommendations applicable to both workloads

External Data Directory

The --data flag enables extending embedded recipe data with external files:

Requirements:

External directory must contain registry.yaml
No symlinks allowed (security)
Max file size: 10MB per file

Merge Rules:

registry.yaml: Components merged by name (external overrides embedded)
All other files: External replaces embedded if path matches

Usage Examples

$ # Query mode - generate recipe from parameters
$ aicr recipe --os ubuntu --service eks --accelerator h100 --intent training
$ 
$ # Snapshot mode - analyze snapshot for training workloads
$ aicr recipe --snapshot system.yaml --intent training
$ 
$ # Snapshot mode with output file
$ aicr recipe -s system.yaml -i inference -o recipe.yaml
$ 
$ # Query mode with full specification
$ aicr recipe \
>   --service eks \
>   --accelerator gb200 \
>   --intent training \
>   --os ubuntu \
>   --platform kubeflow \
>   --nodes 8 \
>   --format yaml
$ 
$ # Use external data directory
$ aicr recipe --service eks --accelerator h100 --data ./my-custom-data
$ 
$ # Bundle with external data
$ aicr bundle --recipe recipe.yaml --data ./my-custom-data --output ./bundles

Recipe Output Structure

1 apiVersion: aicr.nvidia.com/v1alpha1
2 kind: Recipe
3 metadata:
4   version: v1.0.0
5   created: "2025-01-15T10:30:00Z"
6   appliedOverlays:
7     - base
8     - eks
9     - eks-training
10     - gb200-eks-training
11     - gb200-eks-ubuntu-training
12 criteria:
13   service: eks
14   accelerator: gb200
15   intent: training
16   os: ubuntu
17   nodes: 8
18 componentRefs:
19   - name: gpu-operator
20     version: v25.3.3
21     order: 1
22     repository: https://helm.ngc.nvidia.com/nvidia
23   - name: network-operator
24     version: v25.4.0
25     order: 2
26     repository: https://helm.ngc.nvidia.com/nvidia
27 constraints:
28   driver:
29     version: "580.82.07"
30     cudaVersion: "13.1"

Error Handling

Query Mode:
- Invalid parameter values: Returns error with supported options
- Missing required parameters: Allows “any” as default fallback
- No matching overlays: Returns recipe with base configuration
Snapshot Mode:
- Missing snapshot file: File not found error with path
- Invalid snapshot format: Parse error with details
- Invalid intent: Returns error with supported intent types (training, inference, any)
- Extraction failures: Best-effort extraction with partial criteria

Common Errors:

Unknown output format: Error with supported formats list (json, yaml)

Query Command: `pkg/cli/query.go`

Extracts specific values from the fully hydrated recipe configuration using dot-path selectors.

Command Flow

Hydration Process

The query command builds a fully hydrated map[string]any from the RecipeResult:

Recipe-level fields (criteria, metadata, deploymentOrder, constraints) are mapped directly
Each ComponentRef is expanded into a component map with metadata fields (name, chart, source, version, etc.)
GetValuesForComponent is called per component to merge base values, overlay values, and inline overrides
The merged values are inlined under each component’s values key

Selector Resolution

The selector uses dot-delimited path walking. Leading dots are stripped (yq-style), so .components.X and components.X are equivalent. An empty selector or . returns the entire hydrated map.

Usage Examples

$ # Scalar value — plain text output
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver.version
$ 
$ # Subtree — YAML output
$ aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver
$ 
$ # Shell-friendly for scripting
$ VERSION=$(aicr query --service eks --accelerator h100 --intent training \
>   --selector components.gpu-operator.values.driver.version)

Implementation: pkg/recipe/query.go (HydrateResult, Select)

Bundle Command: `pkg/cli/bundle.go`

Generates deployment-ready bundles (Helm values, Kubernetes manifests, installation scripts) from recipes.

Command Flow

Detailed Data Flow

Bundler Data Flow

Simplified Architecture (RecipeResult-to-Template):

Key Simplification: Single RecipeResult path (no dual Recipe/RecipeResult routing)
Data Flow: RecipeResult → Values Map + ScriptData → Templates
Templates: Use index .Values "key" for config, .Script.* for metadata

Bundler Architecture

BaseBundler Helper Pattern

1 // Bundlers embed BaseBundler and override Make()
2 type Bundler struct \{
3     *bundler.BaseBundler  // Provides common functionality
4 \}
5 
6 func NewBundler() *Bundler \{
7     return &Bundler\{
8         BaseBundler: bundler.NewBaseBundler(bundlerType, templatesFS),
9     \}
10 \}
11 
12 // Self-register at init time using MustRegister
13 func init() \{
14     bundler.MustRegister("gpu-operator", NewBundler())
15 \}

RecipeResult-Based Data Access

1 // Get component reference from RecipeResult
2 component := input.GetComponentRef(Name)
3 values := input.GetValuesForComponent(Name)
4 
5 // Generate script metadata
6 scriptData := generateScriptData(component, values)
7 
8 // Pass values map to templates (config values)
9 b.GenerateFileFromTemplate(ctx, GetTemplate, "values.yaml", path, values, 0644)
10 
11 // Pass ScriptData to scripts (metadata)
12 b.GenerateFileFromTemplate(ctx, GetTemplate, "install.sh", path, scriptData, 0755)
13 
14 // Pass combined data to README
15 readmeData := map[string]interface\{\}\{"Values": values, "Script": scriptData\}
16 b.GenerateFileFromTemplate(ctx, GetTemplate, "README.md", path, readmeData, 0644)

Data Flow: RecipeResult → Values/ScriptData → Template

RecipeResult → GetComponentRef(Name) → ComponentRef
             → GetValuesForComponent(Name) → values map
             → generateScriptData() → ScriptData struct
             → Template (\{\{ index .Values "key" \}\} or \{\{ .Script.Namespace \}\})

Registry Pattern

1 // Dynamic bundler discovery
2 bundlers := defaultRegistry.GetAll()  // Returns all registered bundlers
3 bundlers := defaultRegistry.Get(type) // Returns specific bundler
4 
5 // MustRegister panics on duplicate types (fail-fast)
6 bundler.MustRegister("gpu-operator", NewBundler())

DefaultBundler Options:

WithBundlerTypes([]BundleType) – Specify bundler types (empty = all registered)
WithFailFast(bool) – Stop on first error (default: false/collect all)
WithConfig(*Config) – Provide bundler configuration
WithRegistry(*Registry) – Use custom bundler registry

Execution:

Parallel execution by default: Uses errgroup.WithContext for concurrent execution
- All bundlers run concurrently when no types specified
- Faster for multiple bundlers
- Context cancellation propagates to all bundlers
- Bundlers are stateless (thread-safe by design)
- BaseBundler provides thread-safe operations

Architecture Benefits:

75% less code per bundler (BaseBundler eliminates boilerplate)
34% less test code (TestHarness standardizes testing)
15+ internal helpers for recipe parsing
Automatic registration via init() functions
Fail-fast on duplicate bundler types

Usage Examples

$ # Generate all recipe components (parallel by default)
$ aicr bundle --recipe recipe.yaml --output ./bundles
$ 
$ # Use short flags
$ aicr bundle -r recipe.yaml -o ./bundles
$ 
$ # Override values at generation time
$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:gds.enabled=true \
>   --set gpuoperator:driver.version=570.86.16 \
>   -o ./bundles
$ 
$ # Override values for multiple components
$ aicr bundle -r recipe.yaml \
>   --set gpuoperator:mig.strategy=mixed \
>   --set networkoperator:rdma.enabled=true \
>   -o ./bundles
$ 
$ # Schedule system components on system node pool
$ aicr bundle -r recipe.yaml \
>   --system-node-selector nodeGroup=system-pool \
>   --system-node-toleration dedicated=system:NoSchedule \
>   -o ./bundles
$ 
$ # Schedule GPU workloads on labeled GPU nodes
$ aicr bundle -r recipe.yaml \
>   --accelerated-node-selector nvidia.com/gpu.present=true \
>   --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
>   -o ./bundles

Bundle Output Structure

./bundles/
├── gpu-operator/
│   ├── values.yaml              # Helm chart values
│   ├── manifests/
│   │   └── clusterpolicy.yaml  # ClusterPolicy CR
│   ├── scripts/
│   │   └── install.sh          # Installation script (uninstall delegates to `helm uninstall`)
│   ├── README.md                # Deployment instructions
│   └── checksums.txt            # SHA256 verification
├── network-operator/
│   ├── values.yaml
│   ├── manifests/
│   │   └── nicclusterpolicy.yaml
│   ├── scripts/
│   ├── README.md
│   └── checksums.txt
├── cert-manager/
│   ├── values.yaml
│   ├── README.md
│   └── checksums.txt
├── nvsentinel/
│   ├── values.yaml
│   ├── README.md
│   └── checksums.txt
└── nodewright-operator/
    ├── values.yaml
    ├── manifests/
    │   └── nodewright.yaml
    ├── README.md
    └── checksums.txt

Error Handling

Validation Errors:

Missing recipe file: File not found error with path
Invalid recipe format: Parse error with details
Invalid bundler type: Error with list of supported types
Empty measurements: Recipe validation failure

Execution Errors:

FailFast=false (default): Collects all errors, continues execution
- Returns partial results with error list
- Exit code indicates failure count
FailFast=true: Stops on first bundler error
- Returns immediately with error
- Subsequent bundlers not executed

Common Error Scenarios:

$ # Missing recipe file
$ $ aicr bundle --output ./bundles
$ Error: required flag "recipe" not set
$ 
$ # Bundler failures (FailFast=false)
$ $ aicr bundle -r recipe.yaml
$ Error: bundle generation completed with errors: 1/2 bundlers failed

CLI Integration

The bundle command integrates with the CLI through:

Shared Serializer: Uses same serializer.FromFile for recipe loading
Structured Logging: Consistent slog structured logging
Context Propagation: Respects context cancellation
Error Patterns: Uses same error handling conventions

Log Output Example:

INFO  generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTypes=[gpu-operator]
INFO  starting bundle generation bundler_count=1 output_dir=./bundles
INFO  bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms
INFO  bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."

Shared Infrastructure

Collector Factory Pattern

The CLI uses the Factory Pattern for collector instantiation, enabling:

Testability: Inject mock collectors for unit tests
Flexibility: Easy to add new collector types
Encapsulation: Hide collector creation complexity

1 type Factory interface \{
2     CreateSystemDCollector() Collector
3     CreateOSCollector() Collector
4     CreateKubernetesCollector() Collector
5     CreateGPUCollector() Collector
6     CreateNodeTopologyCollector() Collector
7 \}

Serializer Abstraction

Output formatting is abstracted through the serializer.Serializer interface:

1 type Serializer interface \{
2     Serialize(data interface\{\}) error
3 \}

Implementations:

JSON: encoding/json with 2-space indent
YAML: gopkg.in/yaml.v3
Table: text/tabwriter for columnar display

Measurement Data Model

All collected data uses a unified measurement.Measurement structure:

1 type Measurement struct \{
2     Type     Type      // os, k8s, systemd, gpu
3     Subtypes []Subtype // Named collections of readings
4 \}
5 
6 type Subtype struct \{
7     Name    string                // grub, kmod, sysctl, server, image, etc.
8     Data    map[string]Reading    // Key-value readings
9     Context map[string]string     // Human-readable descriptions
10 \}
11 
12 type Reading struct \{
13     Value interface\{\}  // Actual value (int, string, bool, float64)
14 \}

Error Handling

CLI Error Strategy

Flag Validation: User-friendly error messages for invalid flags
Version Parsing: Specific error types (ErrNegativeComponent, etc.)
Collector Failures: Log errors, continue with partial data where possible
Serialization Errors: Fatal - abort and report
Exit Codes: Non-zero exit code on any failure

Example Error Messages

$ # Invalid accelerator type
$ $ aicr recipe --accelerator invalid-gpu
$ [cli] command failed: error=[INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=2
$ 
$ # Unknown output format
$ $ aicr snapshot --format xml
$ Error: unknown output format: "xml"
$ 
$ # Missing required parameters
$ $ aicr recipe
$ # Still succeeds - generates base recipe with no overlays

Performance Characteristics

Snapshot Command

Parallel Collection: All collectors run concurrently via errgroup
Typical Duration: 100-500ms depending on cluster size
Memory Usage: ~10-50MB for typical workloads
Scalability: O(n) with number of pods/nodes for K8s collector

Recipe Command

Store Loading: Once per process (cached via sync.Once)
Typical Duration: <10ms after initial load
Memory Usage: ~5-10MB (embedded YAML + parsed structure)
Scalability: O(m) with number of overlays (typically <100)

Build Configuration

Version Injection via ldflags

Build-time version information injection:

1 VERSION ?= $(shell git describe --tags --always --dirty)
2 COMMIT ?= $(shell git rev-parse --short HEAD)
3 DATE ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)
4 
5 LDFLAGS := -X github.com/NVIDIA/aicr/pkg/cli.version=$(VERSION)
6 LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.commit=$(COMMIT)
7 LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.date=$(DATE)
8 
9 go build -ldflags="$(LDFLAGS)" -o bin/aicr ./cmd/aicr

Testing Strategy

Unit Tests

Flag parsing and validation
Version parsing and error handling
Query building from command flags
Serializer format selection

Integration Tests

Mock collectors for deterministic output
Full command execution with fake factory
Output format validation

Example Test Structure

1 func TestSnapshotCommand(t *testing.T) \{
2     // Create mock factory
3     mockFactory := &MockFactory\{
4         k8s:     mockK8sCollector,
5         systemd: mockSystemDCollector,
6         os:      mockOSCollector,
7         gpu:     mockGPUCollector,
8     \}
9     
10     // Execute snapshot with mock
11     snapshotter := NodeSnapshotter\{
12         Factory: mockFactory,
13         Serializer: &bytes.Buffer\{\},
14     \}
15     
16     err := snapshotter.Measure(ctx)
17     assert.NoError(t, err)
18 \}

Dependencies

External Libraries

github.com/urfave/cli/v3 - CLI framework
golang.org/x/sync/errgroup - Concurrent error handling
gopkg.in/yaml.v3 - YAML parsing
log/slog - Structured logging

Internal Packages

pkg/collector - System data collection
pkg/measurement - Data model
pkg/recipe - Recipe building
pkg/version - Semantic versioning
pkg/serializer - Output formatting
pkg/logging - Logging configuration
pkg/snapshotter - Snapshot orchestration

Future Enhancements

Short-Term (< 3 months)

Caching Layer
Rationale: Reduce latency for repeated aicr snapshot calls in scripts
Implementation: sync.Map with TTL-based eviction using time.AfterFunc
Trade-off: Stale data risk vs 5-10x performance improvement
Reference: sync.Map
Differential Snapshots
Use Case: CI/CD pipelines detecting configuration drift
Implementation: github.com/google/go-cmp/cmp for deep comparison
Output: JSON Patch (RFC 6902) format for machine consumption
CLI: aicr diff baseline.yaml current.yaml --format patch
Measurement Filtering
Use Case: Extract only GPU data without K8s overhead
CLI: aicr snapshot --filter gpu,os --exclude k8s
Implementation: Post-collection filtering before serialization
Performance: Saves 60-70% execution time when K8s excluded
Batch Mode
Use Case: Fleet-wide configuration auditing (100s of nodes)
Implementation: Worker pool with errgroup.SetLimit()
CLI: aicr snapshot --nodes nodes.txt --workers 10 --output results/
Reference: errgroup Limits

Mid-Term (3-6 months)

Plugin System
Rationale: Custom collectors without forking codebase
Interface: type Collector interface \{ Collect(context.Context) (Measurement, error) \}
Options: Go plugins (unstable across versions) or WASM (safe, portable)
Security: Sandboxed execution with restricted syscalls
Reference: WebAssembly System Interface
Configuration Files
Use Case: Avoid repeating —os, —gpu flags
Format: YAML following XDG Base Directory spec
Location: ~/.config/aicr/config.yaml (Linux/macOS), %APPDATA%\aicr\config.yaml (Windows)
Example:
```
1 defaults:
2   os: ubuntu
3   gpu: h100
4   format: yaml
5 server:
6   url: https://recipe-api.example.com
```
Watch Mode
Implementation: Hybrid of fsnotify + periodic polling
CLI: aicr snapshot --watch --interval 30s --on-change ./alert.sh
Output: Stream of JSON diffs to stdout
Use Case: Real-time monitoring with alerting
Schema Validation
Use Case: Ensure snapshots conform to API version spec
Implementation: Embed JSON Schema in binary with go:embed
Library: github.com/santhosh-tekuri/jsonschema/v5 (fastest Go validator)
CLI: aicr validate --schema v1 snapshot.json

Long-Term (6-12 months)

gRPC Mode
Rationale: Better streaming, 3-5x smaller payloads than JSON
Implementation: Bi-directional streaming with protobuf
Trade-off: Added complexity (proto definitions) vs performance gains
Reference: gRPC Go
Distributed Tracing
Use Case: Debug performance issues across collectors
Implementation: OpenTelemetry SDK with span per collector
Exporter: OTLP to Jaeger/Tempo
CLI: aicr snapshot --trace --trace-endpoint localhost:4317
Reference: OpenTelemetry Go
Policy Enforcement
Use Case: Block non-compliant configs in CI/CD
Implementation: Embed OPA (github.com/open-policy-agent/opa)
CLI: aicr validate --policy policy.rego snapshot.yaml
Exit Code: 0 = pass, 1 = policy violations
Reference: OPA Go Integration
Cloud Storage Integration
Use Case: Centralized storage for fleet management
CLI: aicr snapshot --upload s3://bucket/snapshots/$(hostname).yaml
Implementation: AWS SDK v2 with resumable uploads
Authentication: IAM roles, service accounts, credential chain
Reference: AWS SDK for Go V2

Production Deployment Patterns

Pattern 1: CI/CD Integration

Use Case: Automated configuration validation in build pipelines

GitLab CI Example:

1 validate_gpu_config:
2   stage: test
3   image: ghcr.io/nvidia/aicr:latest
4   script:
5     - aicr snapshot --format json > snapshot.json
6     # Validate against known-good baseline
7     - diff -u expected_snapshot.json snapshot.json
8     # Or use OPA policy (future enhancement)
9     # - aicr validate --policy policies/gpu_baseline.rego snapshot.json
10   only:
11     - merge_requests
12   artifacts:
13     when: on_failure
14     paths:
15       - snapshot.json

GitHub Actions Example:

1 name: Validate GPU Configuration
2 on:
3   pull_request:
4     paths:
5       - 'ansible/**'
6       - 'terraform/**'
7 
8 jobs:
9   validate:
10     runs-on: [self-hosted, gpu]
11     steps:
12       - uses: actions/checkout@v4
13       
14       - name: Install aicr
15         run: |
16           curl -sfL https://raw.githubusercontent.com/.../installer | bash -s --
17           echo "$HOME/.local/bin" >> $GITHUB_PATH
18       
19       - name: Capture snapshot
20         run: aicr snapshot --format yaml --output snapshot.yaml
21       
22       - name: Generate recipe
23         run: aicr recipe --os ubuntu --gpu h100 > recipe.yaml
24       
25       - name: Compare configurations
26         run: |
27           yq eval '.measurements[] | select(.type=="GPU")' snapshot.yaml > actual_gpu.yaml
28           yq eval '.measurements[] | select(.type=="GPU")' recipe.yaml > expected_gpu.yaml
29           diff -u expected_gpu.yaml actual_gpu.yaml || \
30             (echo "::error::GPU configuration drift detected" && exit 1)
31       
32       - name: Upload artifact
33         if: failure()
34         uses: actions/upload-artifact@v4
35         with:
36           name: configuration-drift
37           path: |
38             snapshot.yaml
39             recipe.yaml

Jenkins Pipeline:

1 pipeline \{
2     agent \{ label 'gpu-node' \}
3     
4     stages \{
5         stage('Snapshot') \{
6             steps \{
7                 sh 'aicr snapshot --format json > snapshot.json'
8             \}
9         \}
10         
11         stage('Validate') \{
12             steps \{
13                 script \{
14                     def snapshot = readJSON file: 'snapshot.json'
15                     def gpuDriver = snapshot.measurements
16                         .find \{ it.type == 'GPU' \}
17                         .subtypes.find \{ it.subtype == 'smi' \}
18                         .data.'driver-version'
19                     
20                     if (gpuDriver != '570.158.01') \{
21                         error("Incorrect GPU driver: $\{gpuDriver\}")
22                     \}
23                 \}
24             \}
25         \}
26     \}
27     
28     post \{
29         always \{
30             archiveArtifacts artifacts: 'snapshot.json', fingerprint: true
31         \}
32     \}
33 \}

Pattern 2: Scheduled Auditing

Use Case: Nightly configuration drift detection across fleet

Kubernetes CronJob:

1 apiVersion: batch/v1
2 kind: CronJob
3 metadata:
4   name: aicr-audit
5   namespace: monitoring
6 spec:
7   schedule: "0 2 * * *"  # 2 AM daily
8   concurrencyPolicy: Forbid  # Prevent overlapping runs
9   successfulJobsHistoryLimit: 7
10   failedJobsHistoryLimit: 3
11   jobTemplate:
12     spec:
13       template:
14         metadata:
15           labels:
16             app: aicr-audit
17         spec:
18           serviceAccountName: aicr
19           nodeSelector:
20             node-role.kubernetes.io/gpu: "true"
21           tolerations:
22           - key: nvidia.com/gpu
23             operator: Exists
24             effect: NoSchedule
25           containers:
26           - name: aicr
27             image: ghcr.io/nvidia/aicr:&lt;release-tag>  # replace with the AICR release you target
28             command:
29               - /bin/sh
30               - -c
31               - |
32                 set -e
33                 TIMESTAMP=$(date +%Y%m%d-%H%M%S)
34                 HOSTNAME=$(hostname)
35                 
36                 # Capture snapshot
37                 aicr snapshot --format yaml > /tmp/snapshot.yaml
38                 
39                 # Store as ConfigMap with retention
40                 kubectl create configmap \
41                   "aicr-snapshot-$\{HOSTNAME\}-$\{TIMESTAMP\}" \
42                   --from-file=snapshot=/tmp/snapshot.yaml \
43                   --dry-run=client -o yaml | \
44                 kubectl apply -f -
45                 
46                 # Cleanup old snapshots (keep last 30 days)
47                 kubectl get configmaps -l aicr-snapshot=true \
48                   --sort-by=.metadata.creationTimestamp | \
49                 head -n -30 | \
50                 xargs -r kubectl delete configmap
51             resources:
52               limits:
53                 memory: 256Mi
54               requests:
55                 cpu: 100m
56                 memory: 128Mi
57           restartPolicy: OnFailure

Systemd Timer (Bare Metal):

1 # /etc/systemd/system/aicr-audit.service
2 [Unit]
3 Description=AICR Configuration Audit
4 After=network.target
5 
6 [Service]
7 Type=oneshot
8 ExecStart=/usr/local/bin/aicr snapshot --format json --output /var/log/aicr/snapshot-%Y%m%d.json
9 User=aicr
10 Group=aicr
11 
12 # Hardening
13 PrivateTmp=true
14 NoNewPrivileges=true
15 ReadOnlyPaths=/usr /etc
16 ReadWritePaths=/var/log/aicr
17 
18 [Install]
19 WantedBy=multi-user.target
20 
21 # /etc/systemd/system/aicr-audit.timer
22 [Unit]
23 Description=AICR Audit Timer
24 
25 [Timer]
26 OnCalendar=daily
27 Persistent=true
28 
29 [Install]
30 WantedBy=timers.target

Enable with:

$ sudo systemctl enable --now aicr-audit.timer
$ sudo systemctl list-timers aicr-audit.timer

Pattern 3: Fleet Management

Use Case: Collect snapshots from 100s of GPU nodes in parallel

Ansible Playbook:

1 ---
2 - name: Collect AICR Snapshots from GPU Fleet
3   hosts: gpu_nodes
4   gather_facts: yes
5   serial: 10  # Process 10 nodes at a time
6   tasks:
7     - name: Ensure aicr is installed
8       stat:
9         path: /usr/local/bin/aicr
10       register: aicr_binary
11       failed_when: not aicr_binary.stat.exists
12     
13     - name: Collect snapshot
14       shell: aicr snapshot --format json
15       register: snapshot
16       changed_when: false
17       failed_when: snapshot.rc != 0
18     
19     - name: Upload to S3
20       aws_s3:
21         bucket: fleet-snapshots
22         object: "\{\{ inventory_hostname \}\}/\{\{ ansible_date_time.iso8601 \}\}.json"
23         content: "\{\{ snapshot.stdout \}\}"
24         mode: put
25       delegate_to: localhost
26       run_once: false
27     
28     - name: Validate against baseline
29       shell: |
30         echo '\{\{ snapshot.stdout \}\}' | \
31         jq '.measurements[] | select(.type=="GPU") | .subtypes[] | 
32             select(.subtype=="smi") | .data."driver-version"'
33       register: driver_version
34       failed_when: driver_version.stdout != '"570.158.01"'
35       changed_when: false
36 
37 - name: Generate Fleet Report
38   hosts: localhost
39   tasks:
40     - name: Download all snapshots
41       aws_s3:
42         bucket: fleet-snapshots
43         mode: list
44       register: s3_objects
45     
46     - name: Aggregate results
47       script: scripts/aggregate_snapshots.py
48       args:
49         snapshots: "\{\{ s3_objects.s3_keys \}\}"

Terraform Provisioning:

1 resource "null_resource" "aicr_snapshot" \{
2   count = length(var.gpu_instance_ids)
3   
4   provisioner "remote-exec" \{
5     inline = [
6       "aicr snapshot --format json > /tmp/snapshot.json",
7       "aws s3 cp /tmp/snapshot.json s3://fleet-snapshots/$\{self.id\}/"
8     ]
9     
10     connection \{
11       type        = "ssh"
12       host        = element(var.gpu_instance_ips, count.index)
13       user        = "ubuntu"
14       private_key = file("~/.ssh/id_rsa")
15     \}
16   \}
17   
18   triggers = \{
19     instance_id = element(var.gpu_instance_ids, count.index)
20     timestamp   = timestamp()
21   \}
22 \}
23 
24 data "aws_s3_objects" "snapshots" \{
25   bucket     = "fleet-snapshots"
26   depends_on = [null_resource.aicr_snapshot]
27 \}
28 
29 output "snapshot_count" \{
30   value = length(data.aws_s3_objects.snapshots.keys)
31 \}

Pattern 4: Real-Time Monitoring

Use Case: Continuous configuration monitoring with Prometheus alerting

Prometheus Exporter (future enhancement):

1 package main
2 
3 import (
4     "context"
5     "net/http"
6     "time"
7     
8     "github.com/prometheus/client_golang/prometheus"
9     "github.com/prometheus/client_golang/prometheus/promhttp"
10     "github.com/NVIDIA/aicr/pkg/snapshotter"
11 )
12 
13 var (
14     gpuDriverVersion = prometheus.NewGaugeVec(
15         prometheus.GaugeOpts\{
16             Name: "aicr_gpu_driver_version",
17             Help: "NVIDIA driver version (encoded as float)",
18         \},
19         []string\{"node", "gpu_model"\},
20     )
21     
22     k8sVersion = prometheus.NewGaugeVec(
23         prometheus.GaugeOpts\{
24             Name: "aicr_k8s_version",
25             Help: "Kubernetes version (encoded)",
26         \},
27         []string\{"node"\},
28     )
29 )
30 
31 func init() \{
32     prometheus.MustRegister(gpuDriverVersion, k8sVersion)
33 \}
34 
35 func collectMetrics() \{
36     ticker := time.NewTicker(30 * time.Second)
37     defer ticker.Stop()
38     
39     for range ticker.C \{
40         ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
41         snapshot, err := snapshotter.Measure(ctx)
42         cancel()
43         
44         if err != nil \{
45             log.Printf("Snapshot failed: %v", err)
46             continue
47         \}
48         
49         // Extract and export GPU driver version
50         for _, m := range snapshot.Measurements \{
51             if m.Type == "GPU" \{
52                 for _, st := range m.Subtypes \{
53                     if st.Subtype == "smi" \{
54                         version := st.Data["driver-version"]
55                         encoded := encodeVersion(version)
56                         gpuModel := st.Data["gpu-name"]
57                         gpuDriverVersion.WithLabelValues(hostname, gpuModel).Set(encoded)
58                     \}
59                 \}
60             \}
61         \}
62     \}
63 \}
64 
65 func main() \{
66     go collectMetrics()
67     http.Handle("/metrics", promhttp.Handler())
68     http.ListenAndServe(":9090", nil)
69 \}

Prometheus Alerting Rules:

1 groups:
2 - name: aicr_configuration
3   interval: 60s
4   rules:
5   - alert: GPUDriverVersionMismatch
6     expr: |
7       count(count by (aicr_gpu_driver_version) (aicr_gpu_driver_version)) > 1
8     for: 5m
9     labels:
10       severity: warning
11     annotations:
12       summary: "Multiple GPU driver versions detected in cluster"
13       description: "\{\{ $value \}\} different driver versions found"
14   
15   - alert: KubernetesVersionSkew
16     expr: |
17       abs(aicr_k8s_version - scalar(avg(aicr_k8s_version))) > 0.01
18     for: 10m
19     labels:
20       severity: critical
21     annotations:
22       summary: "Kubernetes version skew detected on \{\{ $labels.node \}\}"
23       description: "Node version differs from cluster average"

Advanced Usage Patterns

Snapshot Diffing with jq

$ #!/bin/bash
$ # Capture baseline before changes
$ aicr snapshot --format json > baseline.json
$ 
$ # Apply configuration changes (Ansible, Terraform, etc.)
$ # ...
$ 
$ # Capture new snapshot
$ aicr snapshot --format json > current.json
$ 
$ # Diff specific sections
$ echo "=== GPU Configuration Changes ==="
$ diff -u \
>   &lt;(jq -S '.measurements[] | select(.type=="GPU")' baseline.json) \
>   &lt;(jq -S '.measurements[] | select(.type=="GPU")' current.json)
$ 
$ echo "=== Kernel Parameter Changes ==="
$ diff -u \
>   &lt;(jq -S '.measurements[] | select(.type=="os") | .subtypes[] | 
>            select(.subtype=="sysctl")' baseline.json) \
>   &lt;(jq -S '.measurements[] | select(.type=="os") | .subtypes[] | 
>            select(.subtype=="sysctl")' current.json)
$ 
$ # Count total changes
$ changes=$(diff &lt;(jq -S . baseline.json) &lt;(jq -S . current.json) | grep -c '^[&lt;>]')
$ echo "Total configuration changes: $changes"

Recipe Generation Pipeline

$ #!/bin/bash
$ # Generate recipes for all supported configurations
$ 
$ set -euo pipefail
$ 
$ OUTPUT_DIR="recipes"
$ mkdir -p "$OUTPUT_DIR"
$ 
$ # GPU types from NVIDIA product line
$ GPU_TYPES=("h100" "h200" "gb200" "b200" "a100" "l40" "rtx-pro-6000")
$ 
$ # Kubernetes services
$ K8S_SERVICES=("eks" "gke" "aks" "oke" "kind" "lke" "bcm")
$ 
$ # OS distributions
$ OS_TYPES=("ubuntu" "rhel" "cos")
$ 
$ total=0
$ for gpu in "$\{GPU_TYPES[@]\}"; do
$   for service in "$\{K8S_SERVICES[@]\}"; do
$     for os in "$\{OS_TYPES[@]\}"; do
$       output="$\{OUTPUT_DIR\}/$\{os\}-$\{service\}-$\{gpu\}.yaml"
$       
$       # Generate recipe
$       if aicr recipe --os "$os" --service "$service" --gpu "$gpu" \
>            --format yaml > "$output" 2>/dev/null; then
$         echo "✓ Generated $output"
$         ((total++))
$       else
$         echo "✗ Failed: $os $service $gpu"
$       fi
$     done
$   done
$ done
$ 
$ echo "Generated $total recipes"
$ 
$ # Validate all recipes
$ echo "Validating recipes..."
$ find "$OUTPUT_DIR" -name '*.yaml' -exec yq eval '.' \{\} \; > /dev/null
$ echo "All recipes valid"
$ 
$ # Create index
$ cat > "$OUTPUT_DIR/README.md" &lt;&lt;EOF
$ # Configuration Recipes
$ 
$ Generated on $(date -Iseconds)
$ 
$ Total recipes: $total
$ 
$ ## Available Configurations
$ 
$ | OS | Service | GPU | File |
$ |----|---------|-----|------|
$ EOF
$ 
$ find "$OUTPUT_DIR" -name '*.yaml' -type f | sort | while read -r file; do
$   base=$(basename "$file" .yaml)
$   IFS='-' read -ra parts &lt;&lt;&lt; "$base"
$   echo "| $\{parts[0]\} | $\{parts[1]\} | $\{parts[2]\} | $file |" >> "$OUTPUT_DIR/README.md"
$ done

Automated Remediation

$ #!/bin/bash
$ # Apply recommended configuration from recipe
$ # WARNING: Modifies system configuration - use with caution
$ 
$ set -euo pipefail
$ 
$ # Capture current state
$ current=$(aicr snapshot --format json)
$ 
$ # Generate recommended recipe
$ recipe=$(aicr recipe --os ubuntu --gpu h100 --format json)
$ 
$ # Extract recommended GRUB parameters
$ recommended_grub=$(echo "$recipe" | jq -r '
>   .measurements[] | 
>   select(.type=="os") | 
>   .subtypes[] | 
>   select(.subtype=="grub") | 
>   .data | 
>   to_entries[] | 
>   "\(.key)=\(.value)"' | tr '\n' ' ')
$ 
$ # Extract current GRUB parameters
$ current_grub=$(echo "$current" | jq -r '
>   .measurements[] | 
>   select(.type=="os") | 
>   .subtypes[] | 
>   select(.subtype=="grub") | 
>   .data | 
>   to_entries[] | 
>   "\(.key)=\(.value)"' | tr '\n' ' ')
$ 
$ # Show diff
$ echo "Current GRUB parameters:"
$ echo "$current_grub"
$ echo ""
$ echo "Recommended GRUB parameters:"
$ echo "$recommended_grub"
$ echo ""
$ 
$ # Prompt for confirmation
$ read -p "Apply changes? (yes/no): " confirm
$ if [[ "$confirm" != "yes" ]]; then
$   echo "Aborted"
$   exit 0
$ fi
$ 
$ # Apply GRUB changes (requires root)
$ sudo grubby --update-kernel=ALL --args="$recommended_grub"
$ echo "GRUB configuration updated. Reboot required."
$ 
$ # Apply sysctl changes
$ echo "$recipe" | jq -r '
>   .measurements[] | 
>   select(.type=="os") | 
>   .subtypes[] | 
>   select(.subtype=="sysctl") | 
>   .data | 
>   to_entries[] | 
>   "\(.key) = \(.value)"' | \
> sudo tee /etc/sysctl.d/99-aicr-recommended.conf
$ 
$ sudo sysctl --system
$ echo "Sysctl parameters applied"
$ 
$ # Log changes
$ echo "$(date -Iseconds): Applied AICR recommendations" | \
> sudo tee -a /var/log/aicr-remediation.log

Troubleshooting Guide

Issue: “nvidia-smi not found”

Symptoms: GPU measurements empty, error in logs
Root Cause: NVIDIA driver not installed or not in PATH

Diagnosis:

$ # Check if nvidia-smi exists
$ which nvidia-smi
$ # Expected: /usr/bin/nvidia-smi
$ 
$ # Verify driver installation
$ nvidia-smi --version
$ # Expected: NVIDIA-SMI 570.158.01
$ 
$ # Check kernel modules
$ lsmod | grep nvidia
$ # Expected: nvidia, nvidia_uvm, nvidia_modeset
$ 
$ # Verify device nodes
$ ls -l /dev/nvidia*
$ # Expected: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm

Resolution:

$ # Ubuntu: Install NVIDIA driver
$ sudo apt-get update
$ sudo apt-get install -y nvidia-driver-570
$ 
$ # RHEL: Install from CUDA repo
$ sudo dnf config-manager --add-repo \
>   https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
$ sudo dnf install -y nvidia-driver:570
$ 
$ # Verify installation
$ sudo nvidia-smi
$ 
$ # If PATH issue, add to shell profile
$ echo 'export PATH="/usr/bin:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

Issue: “Kubernetes API server unreachable”

Symptoms: K8s measurements empty, “connection refused” error
Root Cause: Not running in cluster, or kubeconfig missing/invalid

Diagnosis:

$ # Verify cluster connectivity
$ kubectl cluster-info
$ # Expected: Kubernetes control plane is running at https://...
$ 
$ # Check kubeconfig
$ echo $KUBECONFIG
$ cat ~/.kube/config
$ 
$ # Test API access
$ kubectl get nodes
$ # Expected: List of nodes
$ 
$ # Check service account (in-cluster)
$ ls -l /var/run/secrets/kubernetes.io/serviceaccount/
$ # Expected: token, ca.crt, namespace

Resolution:

$ # Option 1: Set KUBECONFIG explicitly
$ export KUBECONFIG=~/.kube/config
$ aicr snapshot
$ 
$ # Option 2: Copy admin kubeconfig
$ sudo cp /etc/kubernetes/admin.conf ~/.kube/config
$ sudo chown $(id -u):$(id -g) ~/.kube/config
$ 
$ # Option 3: Use service account token (in-cluster)
$ kubectl create serviceaccount aicr
$ kubectl create clusterrolebinding aicr --clusterrole=view --serviceaccount=default:aicr
$ 
$ # Option 4: Debug with kubectl proxy
$ kubectl proxy &
$ export KUBERNETES_SERVICE_HOST=localhost
$ export KUBERNETES_SERVICE_PORT=8001
$ aicr snapshot

Issue: “Snapshot too slow (> 5s)”

Symptoms: Long execution time, timeouts in CI/CD
Root Cause: Large cluster (1000s of pods), slow API server, many GPUs

Diagnosis:

$ # Enable debug logging to identify slow collectors
$ aicr --debug snapshot 2>&1 | grep -E 'collector|duration'
$ # Expected output shows timing per collector:
$ # time="..." level=debug msg="k8s collector finished" duration=3.2s
$ # time="..." level=debug msg="gpu collector finished" duration=0.8s
$ 
$ # Check cluster size
$ kubectl get pods --all-namespaces --no-headers | wc -l
$ # Large: > 1000 pods
$ 
$ # Check GPU count
$ nvidia-smi --list-gpus | wc -l
$ # Many: > 8 GPUs
$ 
$ # Profile execution
$ time aicr snapshot > /dev/null

Resolution:

$ # Option 1: Filter to specific collectors (future enhancement)
$ aicr snapshot --filter gpu,os  # Skip K8s (saves 60-70% time)
$ 
$ # Option 2: Increase timeout (future enhancement)
$ aicr snapshot --timeout 30s
$ 
$ # Option 3: Use caching for repeated calls
$ aicr snapshot > /tmp/snapshot.json
$ # Reuse /tmp/snapshot.json for subsequent analysis
$ 
$ # Option 4: Optimize K8s collector
$ # Reduce API calls by using label selectors (code change):
$ # clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions\{
> #     LabelSelector: "app=gpu-operator",
> # \})
$ 
$ # Option 5: Run in parallel with errgroup limit
$ # Already implemented in code, but can tune:
$ # g.SetLimit(runtime.NumCPU())  // Current: 2

Issue: “Out of memory during snapshot”

Symptoms: Process killed, OOMKilled in K8s, segfault
Root Cause: Large measurement data (10k+ pods, many images)

Diagnosis:

$ # Check memory usage during snapshot
$ /usr/bin/time -v aicr snapshot > /dev/null 2>&1
$ # Look for "Maximum resident set size"
$ 
$ # Monitor memory in real-time
$ # Terminal 1:
$ watch -n 1 'ps aux | grep aicr'
$ # Terminal 2:
$ aicr snapshot
$ 
$ # In Kubernetes, check OOMKilled events
$ kubectl get events --field-selector reason=OOMKilling

Resolution:

$ # Option 1: Use streaming serialization (already implemented)
$ # Data never fully materialized in memory
$ aicr snapshot --format json > snapshot.json
$ 
$ # Option 2: Increase memory limit in Kubernetes
$ kubectl set resources deployment aicr-agent \
>   --limits=memory=1Gi \
>   --requests=memory=512Mi
$ 
$ # Option 3: Filter measurements (future enhancement)
$ aicr snapshot --filter gpu,os  # Exclude large K8s data
$ 
$ # Option 4: Optimize code to reduce allocations
$ # Use object pooling for repeated structs:
$ var measurementPool = sync.Pool\{
$     New: func() interface\{\} \{
$         return &measurement.Measurement\{\}
$     \},
$ \}
$ 
$ # Option 5: Process in batches (code change needed)
$ # For K8s pods, paginate API calls:
$ pods, err := clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions\{
>     Limit: 100,
>     Continue: continueToken,
> \})

Performance Tuning

CPU Profiling

$ # Build with profiling enabled
$ mkdir -p bin
$ go build -o bin/aicr cmd/aicr/main.go
$ 
$ # Capture CPU profile
$ ./bin/aicr snapshot --cpuprofile=cpu.prof
$ 
$ # Analyze profile
$ go tool pprof cpu.prof
$ (pprof) top10
$ # Shows top 10 functions by CPU time
$ 
$ (pprof) list collectContainerImages
$ # Shows line-by-line CPU usage in specific function
$ 
$ (pprof) web
$ # Opens interactive graph in browser (requires graphviz)
$ 
$ # Example output interpretation:
$ # If collectContainerImages is > 50% CPU:
$ # - Optimize pod iteration
$ # - Reduce string allocations
$ # - Cache image parsing results

Memory Profiling

$ # Capture memory profile
$ ./bin/aicr snapshot --memprofile=mem.prof
$ 
$ # Analyze allocations
$ go tool pprof -alloc_space mem.prof
$ (pprof) top10
$ # Shows top 10 functions by allocations
$ 
$ (pprof) list BuildRecipe
$ # Check for unnecessary allocations
$ 
$ # Example fixes:
$ # Before: strings.Split() allocates slice
$ # After: strings.Index() + slicing avoids allocation
$ 
$ # Before: fmt.Sprintf("%s:%s", name, tag)
$ # After: var b strings.Builder; b.WriteString(name); b.WriteString(":");

Benchmarking

$ # Benchmark snapshot performance (10 iterations)
$ for i in \{1..10\}; do
$   time aicr snapshot --format json > /dev/null
$ done 2>&1 | grep real | awk '\{print $2\}' | \
> sed 's/0m//' | sed 's/s//' | \
> awk '\{sum+=$1; count++\} END \{printf "Average: %.3fs\n", sum/count\}'
$ 
$ # Compare formats
$ echo "JSON:"
$ time aicr snapshot --format json > /dev/null
$ echo "YAML:"
$ time aicr snapshot --format yaml > /dev/null
$ echo "Table:"
$ time aicr snapshot --format table > /dev/null
$ 
$ # Expected results:
$ # JSON:  ~50ms  (fastest, minimal processing)
$ # YAML:  ~80ms  (indentation overhead)
$ # Table: ~100ms (string formatting, column alignment)
$ 
$ # Benchmark with different cluster sizes
$ for pods in 10 100 1000 5000; do
$   # Scale test deployment
$   kubectl scale deployment test-app --replicas=$pods
$   kubectl wait --for=condition=ready pod -l app=test-app --timeout=5m
$   
$   echo "Cluster with $pods pods:"
$   time aicr snapshot --format json > /dev/null
$ done

Optimization Recommendations

Reduce String Allocations
Current: fmt.Sprintf("%s:%s", name, tag) allocates
Optimized: Use strings.Builder for concatenation
Savings: 20-30% fewer allocations in image collector
Preallocate Slices
Current: measurements := []Measurement\{\}
Optimized: measurements := make([]Measurement, 0, expectedSize)
Benefit: Avoids slice growth reallocations
When: Size predictable (e.g., GPU count known)

Pool Large Objects
Use Case: Measurement structs allocated repeatedly
Implementation:

1 var measurementPool = sync.Pool\{
2     New: func() interface\{\} \{
3         return &measurement.Measurement\{\}
4     \},
5 \}
6 
7 m := measurementPool.Get().(*measurement.Measurement)
8 defer measurementPool.Put(m)

Reference: sync.Pool

Avoid Reflection
Current: encoding/json uses reflection
Optimized: Code-generated marshaling with easyjson
Benefit: 2-3x faster JSON serialization
Trade-off: Build complexity vs performance
Reference: easyjson
Batch API Operations
Current: Multiple API calls per collector
Optimized: Aggregate calls where possible
Example: List all pods once, filter in memory
Benefit: Reduces API server load, faster execution

Concurrent Collectors
Current: errgroup with limit
Tuning: Adjust limit based on collector type

1 g.SetLimit(runtime.NumCPU())  // CPU-bound collectors
2 g.SetLimit(runtime.NumCPU() * 2)  // I/O-bound collectors

Reference: errgroup SetLimit

Security Best Practices

Running as Non-Root

CLI:

$ # CLI runs as current user (no special privileges needed)
$ aicr snapshot  # Works as non-root
$ 
$ # Verify no setuid/setgid
$ ls -l $(which aicr)
$ # Expected: -rwxr-xr-x (not -rwsr-xr-x)
$ 
$ # Verify no capabilities
$ getcap $(which aicr)
$ # Expected: (no output)

Kubernetes Job:

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: aicr
5 spec:
6   template:
7     spec:
8       securityContext:
9         runAsNonRoot: true
10         runAsUser: 1000
11         runAsGroup: 1000
12         fsGroup: 1000
13         seccompProfile:
14           type: RuntimeDefault
15       containers:
16       - name: aicr
17         image: ghcr.io/nvidia/aicr:latest
18         securityContext:
19           allowPrivilegeEscalation: false
20           readOnlyRootFilesystem: true
21           capabilities:
22             drop:
23             - ALL
24         volumeMounts:
25         - name: tmp
26           mountPath: /tmp
27       volumes:
28       - name: tmp
29         emptyDir: \{\}

Secrets Management

$ # Never log sensitive data
$ # aicr already filters passwords/tokens from output
$ 
$ # Verify no secrets in snapshot
$ aicr snapshot --format json | \
>   jq '.measurements[].subtypes[].data | 
>       keys | map(select(test("(?i)(password|token|key|secret)"))) | 
>       unique'
$ # Expected: []
$ 
$ # Use environment variables for API credentials (future feature)
$ export AICR_API_TOKEN=$(vault kv get -field=token secret/aicr)
$ aicr recipe --os ubuntu --gpu h100
$ 
$ # Or use Kubernetes secrets
$ kubectl create secret generic aicr-api-creds \
>   --from-literal=token=$(vault kv get -field=token secret/aicr)
$ 
$ # Mount in pod:
$ volumeMounts:
$ - name: api-creds
$   mountPath: /var/run/secrets/aicr
$   readOnly: true
$ volumes:
$ - name: api-creds
$   secret:
$     secretName: aicr-api-creds

Input Validation

CLI validates all inputs before processing:

$ # Invalid OS type
$ aicr recipe --os invalid_os
$ # Error: invalid os type "invalid_os", must be one of: ubuntu, rhel, cos, amazonlinux, talos
$ 
$ # Invalid version format
$ aicr recipe --osv -1.0
$ # Error: invalid version "-1.0": negative version components not allowed
$ 
$ # Invalid GPU type
$ aicr recipe --gpu h100@latest
$ # Error: invalid gpu type "h100@latest": special characters not allowed
$ 
$ # Invalid format
$ aicr snapshot --format xml
$ # Error: invalid format "xml", must be one of: json, yaml, table
$ 
$ # Path traversal prevention
$ aicr snapshot --output ../../etc/passwd
$ # Error: output path escapes current directory
$ 
$ # Verify validation in code:
$ # pkg/cli/recipe.go:
$ if !isValidOS(os) \{
$     return fmt.Errorf("invalid os type %q", os)
$ \}

Network Security

$ # Verify TLS for API calls (future feature)
$ aicr recipe --os ubuntu --gpu h100 --debug 2>&1 | grep -i tls
$ # Expected: "Using TLS 1.3"
$ 
$ # Certificate pinning (future enhancement)
$ export AICR_API_CERT_FINGERPRINT="sha256:abc123..."
$ aicr recipe --os ubuntu --gpu h100
$ 
$ # Use corporate proxy with authentication
$ export HTTPS_PROXY=https://proxy.corp.com:8080
$ export AICR_PROXY_CA_CERT=/etc/ssl/certs/corp-ca.pem
$ aicr recipe --os ubuntu --gpu h100

Bundler Framework: Components and Extension

The bundler framework documented under Bundle Command defines how individual components are turned into deployment artifacts. This section drills into the architecture diagrams, a worked example (GPU Operator), observability surfaces, the add-a-component workflow, and conventions for new bundlers. For command flow, flags, and usage examples, see the Bundle Command section above.

Component Diagram

The Generate README node here is the per-component bundle/<component>/README.md. The root bundle/README.md is generated by the deployer (see Deployer Framework below).

Sequence Diagram

Worked Example: GPU Operator Bundler

The GPU Operator bundler generates a complete deployment bundle for NVIDIA GPU Operator, extracting configuration from recipe measurements.

Recipe Data Extraction

K8s Measurements (measurement.TypeK8s):

Image Subtype — Component versions:

1 - subtype: image
2   data:
3     gpu-operator: v25.3.3
4     driver: 580.82.07
5     container-toolkit: v1.17.8
6     k8s-device-plugin: v0.17.4
7     dcgm: 4.3.1-1
8     dcgm-exporter: 4.3.1

Config Subtype — Boolean flags:

1 - subtype: config
2   data:
3     cdi: true
4     mig: false
5     rdma: true
6     useOpenKernelModule: true

GPU Measurements (measurement.TypeGPU):

1 - subtype: smi
2   data:
3     driver-version: 580.82.07
4     cuda-version: "13.1"

Template Files

values.yaml.tmpl — Helm chart values:

1 # Generated: \{\{ .Timestamp \}\}
2 # GPU Operator Helm Values
3 
4 operator:
5   version: \{\{ .GPUOperatorVersion \}\}
6 
7 driver:
8   enabled: \{\{ .EnableDriver \}\}
9   version: \{\{ .DriverVersion \}\}
10   useOpenKernelModule: \{\{ .UseOpenKernelModule \}\}
11   repository: \{\{ .DriverRegistry \}\}
12 
13 toolkit:
14   version: \{\{ .NvidiaContainerToolkitVersion \}\}
15 
16 devicePlugin:
17   version: \{\{ .DevicePluginVersion \}\}
18 
19 dcgm:
20   version: \{\{ .DCGMVersion \}\}
21 
22 dcgmExporter:
23   version: \{\{ .DCGMExporterVersion \}\}
24 
25 mig:
26   strategy: \{\{ .MIGStrategy \}\}
27 
28 gds:
29   enabled: \{\{ .EnableGDS \}\}

install.sh.tmpl — Installation script:

$ #!/bin/bash
$ # Generated: \{\{ .Timestamp \}\}
$ # GPU Operator Installation Script
$ 
$ set -euo pipefail
$ 
$ NAMESPACE="\{\{ .Namespace \}\}"
$ HELM_REPO="\{\{ .HelmRepository \}\}"
$ HELM_CHART="\{\{ .HelmChart \}\}"
$ 
$ echo "Adding Helm repository..."
$ helm repo add nvidia "$HELM_REPO"
$ helm repo update
$ 
$ echo "Installing GPU Operator..."
$ helm install gpu-operator nvidia/gpu-operator \
>   --namespace "$NAMESPACE" \
>   --create-namespace \
>   --values values.yaml \
>   --wait
$ 
$ echo "Applying ClusterPolicy..."
$ kubectl apply -f manifests/clusterpolicy.yaml
$ 
$ echo "Installation complete!"

Observability

Metrics

Prometheus metrics exposed by the bundler framework:

# Duration histogram
bundler_make_duration_seconds\{bundler_type="gpu-operator"\} 0.245
# Total operations counter
bundler_make_total\{bundler_type="gpu-operator",result="success"\} 42
bundler_make_total\{bundler_type="gpu-operator",result="error"\} 3
# Files generated gauge
bundler_files_generated_total\{bundler_type="gpu-operator"\} 6
# Bytes generated gauge
bundler_bytes_generated_total\{bundler_type="gpu-operator"\} 15360
# Validation failures counter
bundler_validation_failures_total\{bundler_type="gpu-operator"\} 2

Structured Logging

slog integration for structured log output:

1 // Bundle generation start
2 slog.Debug("generating bundle",
3     "bundler_type", bundlerType,
4     "output_dir", outputDir,
5 )
6 
7 // Bundle generation complete
8 slog.Debug("bundle generated successfully",
9     "bundler_type", bundlerType,
10     "files", len(result.Files),
11     "bytes", result.TotalBytes,
12     "duration", result.Duration,
13 )

Adding New Components

Adding a new component requires no Go code. Components are configured declaratively:

Add to Component Registry (recipes/registry.yaml):

1 components:
2   - name: my-operator
3     displayName: My Operator
4     valueOverrideKeys:
5       - myoperator
6     helm:
7       defaultRepository: https://charts.example.com
8       defaultChart: example/my-operator
9       defaultVersion: v1.0.0
10     nodeScheduling:
11       system:
12         nodeSelectorPaths:
13           - operator.nodeSelector
14         tolerationPaths:
15           - operator.tolerations

Create Values File (recipes/components/my-operator/values.yaml):

1 # My Operator Helm values
2 operator:
3   replicas: 1
4   image:
5     repository: example/my-operator
6     tag: v1.0.0

Add to Recipe Overlay (recipes/overlays/<overlay>.yaml):

1 componentRefs:
2   - name: my-operator
3     type: Helm
4     version: v1.0.0
5     source: https://charts.example.com
6     valuesFile: components/my-operator/values.yaml

Test the Component:

$ # Generate recipe with new component
$ aicr recipe --service eks --accelerator h100 -o recipe.yaml
$ 
$ # Generate bundle
$ aicr bundle -r recipe.yaml -o ./bundles
$ 
$ # Verify output
$ cat ./bundles/values.yaml

See Bundler Development Guide for detailed documentation.

Best Practices

Template Design:

Keep templates simple and focused
Use descriptive variable names
Add comments for complex logic
Validate template rendering in tests
Don’t put business logic in templates

Error Handling:

Use structured errors with context (pkg/errors)
Wrap errors with meaningful messages
Validate early (before starting generation)
Clean up resources on error
Don’t swallow errors silently

Testing:

Test with realistic recipe data
Use table-driven tests for coverage
Test error paths explicitly
Verify generated file content
Don’t skip integration tests

Performance:

Use parallel generation for multiple files
Stream large files instead of buffering
Reuse template instances when possible
Profile bundle generation for bottlenecks
Don’t generate synchronously without reason

Deployer Framework: GitOps Integration

The bundle command integrates with GitOps tools through the Deployer Framework, which generates deployment-specific artifacts alongside the standard bundle files.

Overview

Purpose: Generate GitOps-ready deployment artifacts that integrate with popular continuous delivery tools.

Supported Deployers:

Type	Description	Output
`helm`	(Default) Helm per-component bundle	`deploy.sh`, `<component>/values.yaml`, `<component>/README.md`
`argocd`	Argo CD Application manifests	`app-of-apps.yaml`, `<component>/application.yaml`

Key Feature: Deployment Order

All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:

1 # Recipe excerpt
2 deploymentOrder:
3   - gpu-operator      # First
4   - network-operator  # Second
5   - nvsentinel        # Third

Deployer Architecture

Argo CD Deployer

Generates Argo CD Application manifests with proper sync ordering using multi-source Applications.

Ordering Mechanism: Uses argocd.argoproj.io/sync-wave annotation.

1 # gpu-operator/argocd/application.yaml (sync-wave: 0 = first)
2 apiVersion: argoproj.io/v1alpha1
3 kind: Application
4 metadata:
5   name: gpu-operator
6   namespace: argocd
7   annotations:
8     argocd.argoproj.io/sync-wave: "0"
9   finalizers:
10     - resources-finalizer.argocd.argoproj.io
11 spec:
12   project: default
13   sources:
14     # Helm chart from upstream
15     - repoURL: https://helm.ngc.nvidia.com/nvidia
16       chart: gpu-operator
17       targetRevision: v25.3.3
18       helm:
19         valueFiles:
20           - $values/gpu-operator/values.yaml
21     # Values from GitOps repo
22     - repoURL: &lt;YOUR_GIT_REPO>
23       targetRevision: main
24       ref: values
25     # Additional manifests (if present)
26     - repoURL: &lt;YOUR_GIT_REPO>
27       targetRevision: main
28       path: gpu-operator/manifests
29   destination:
30     server: https://kubernetes.default.svc
31     namespace: gpu-operator
32   syncPolicy:
33     automated:
34       prune: true
35       selfHeal: true
36     syncOptions:
37       - CreateNamespace=true
38       - ServerSideApply=true

Output Structure:

bundles/
├── app-of-apps.yaml               # Parent Application (bundle root)
├── recipe.yaml                    # Recipe used to generate bundle
├── gpu-operator/
│   ├── values.yaml
│   ├── manifests/
│   └── argocd/
│       └── application.yaml       # sync-wave: 0
├── network-operator/
│   ├── values.yaml
│   └── argocd/
│       └── application.yaml       # sync-wave: 1
├── nvsentinel/
│   ├── values.yaml
│   └── argocd/
│       └── application.yaml       # sync-wave: 2
└── README.md                      # Argo CD deployment guide

Helm Deployer (Default)

Generates a Helm per-component bundle with individual component directories.

Ordering Mechanism: Dependencies listed in Chart.yaml are deployed in order by Helm.

Output Structure:

bundles/
├── gpu-operator/
│   ├── values.yaml      # Component-specific Helm values
│   ├── scripts/
│   │   └── install.sh   # Installation script
│   ├── README.md        # Deployment instructions
│   └── checksums.txt    # SHA256 checksums
├── recipe.yaml          # Input recipe reference
└── deploy.sh            # Top-level deployment script

Deployer Data Flow

Usage Examples

$ # Default: Helm per-component bundle
$ aicr bundle -r recipe.yaml -o ./bundles
$ 
$ # Generate bundle with Argo CD Applications
$ aicr bundle -r recipe.yaml --deployer argocd -o ./bundles
$ 
$ # Argo CD with Git repository URL (sets repoURL in app-of-apps.yaml)
$ aicr bundle -r recipe.yaml --deployer argocd \
>   --repo https://github.com/my-org/my-gitops-repo.git \
>   -o ./bundles
$ 
$ # Combine with deployer
$ aicr bundle -r recipe.yaml \
>   --deployer argocd \
>   -o ./bundles

Deployment Order Implementation

The orderComponentsByDeployment function ensures components are processed in the correct sequence:

1 // orderComponentsByDeployment sorts components according to deploymentOrder.
2 // Components not in deploymentOrder are appended at the end in their original order.
3 func orderComponentsByDeployment(components []recipe.ComponentRef, 
4     order []string) []recipe.ComponentRef \{
5     
6     if len(order) == 0 \{
7         return components
8     \}
9     
10     orderMap := make(map[string]int)
11     for i, name := range order \{
12         orderMap[name] = i
13     \}
14     
15     // Separate ordered and unordered components
16     ordered := make([]recipe.ComponentRef, 0)
17     unordered := make([]recipe.ComponentRef, 0)
18     
19     for _, c := range components \{
20         if _, exists := orderMap[c.Name]; exists \{
21             ordered = append(ordered, c)
22         \} else \{
23             unordered = append(unordered, c)
24         \}
25     \}
26     
27     // Sort ordered components by their position in deploymentOrder
28     sort.SliceStable(ordered, func(i, j int) bool \{
29         return orderMap[ordered[i].Name] &lt; orderMap[ordered[j].Name]
30     \})
31     
32     return append(ordered, unordered...)
33 \}

Testing Deployers

Each deployer has tests verifying deployment order correctness:

1 func TestDeployer_Generate_DeploymentOrder(t *testing.T) \{
2     recipeResult := &recipe.RecipeResult\{
3         DeploymentOrder: []string\{"gpu-operator", "network-operator"\},
4         ComponentRefs: []recipe.ComponentRef\{
5             \{Name: "network-operator", Version: "v25.4.0"\},
6             \{Name: "gpu-operator", Version: "v25.3.3"\},
7         \},
8     \}
9     
10     d := NewDeployer()
11     artifacts, err := d.Generate(ctx, recipeResult, tmpDir)
12     require.NoError(t, err)
13     
14     // Verify ordering mechanism (sync-wave/dependsOn/README order)
15     // ...
16 \}

References

Official Documentation

urfave/cli Framework - CLI framework used by aicr
errgroup Patterns - Concurrent error handling
YAML v3 Library - YAML parsing and serialization
Structured Logging (slog) - Standard library logging
Context Package - Cancellation and deadlines

Kubernetes Integration

client-go Documentation - Official K8s client
Dynamic Client - Unstructured resource access
CronJob Best Practices - Scheduled job patterns
RBAC Authorization - Permission model

NVIDIA Tools

NVIDIA SMI - GPU management
NVML Library - Programmatic GPU access
CUDA Toolkit - GPU computing platform
GPU Operator - K8s GPU automation

Best Practices

Semantic Versioning - Version comparison algorithm
The Twelve-Factor App - Cloud-native application patterns
Release Engineering Best Practices - Google SRE
Go Code Review Comments - Idiomatic Go

Security

OWASP Secure Coding Practices
Kubernetes Pod Security Standards
NIST 800-190: Container Security
CIS Benchmarks - Security configuration baselines

Overview

Step 1: Snapshot Command

Step 2: Recipe Command

Step 3: Validate Command

Step 4: Bundle Command

Architecture Diagram

ConfigMap Integration

Component Details

Entry Point: cmd/aicr/main.go

Root Command: pkg/cli/root.go

Snapshot Command: pkg/cli/snapshot.go

Command Flow

Detailed Data Flow

Usage Examples

Agent Deployment Pattern

Recipe Command: pkg/cli/recipe.go

Command Flow

Detailed Data Flow

Recipe Matching Algorithm

Usage Examples

Recipe Command Modes

Query Mode (Default)

Snapshot Mode

Query Extraction from Snapshot

External Data Directory

Usage Examples

Recipe Output Structure

Error Handling

Query Command: pkg/cli/query.go

Command Flow

Hydration Process

Selector Resolution

Usage Examples

Bundle Command: pkg/cli/bundle.go

Command Flow

Detailed Data Flow

Bundler Data Flow

Bundler Architecture

BaseBundler Helper Pattern

RecipeResult-Based Data Access

Registry Pattern

Usage Examples

Bundle Output Structure

Error Handling

CLI Integration

Shared Infrastructure

Collector Factory Pattern

Serializer Abstraction

Measurement Data Model

Error Handling

CLI Error Strategy

Example Error Messages

Performance Characteristics

Snapshot Command

Recipe Command

Build Configuration

Version Injection via ldflags

Testing Strategy

Unit Tests

Integration Tests

Example Test Structure

Dependencies

External Libraries

Internal Packages

Future Enhancements

Short-Term (< 3 months)

Mid-Term (3-6 months)

Long-Term (6-12 months)

Production Deployment Patterns

Pattern 1: CI/CD Integration

Pattern 2: Scheduled Auditing

Pattern 3: Fleet Management

Pattern 4: Real-Time Monitoring

Advanced Usage Patterns

Snapshot Diffing with jq

Recipe Generation Pipeline

Automated Remediation

Troubleshooting Guide

Issue: “nvidia-smi not found”

Issue: “Kubernetes API server unreachable”

Entry Point: `cmd/aicr/main.go`

Root Command: `pkg/cli/root.go`

Snapshot Command: `pkg/cli/snapshot.go`

Recipe Command: `pkg/cli/recipe.go`

Query Command: `pkg/cli/query.go`

Bundle Command: `pkg/cli/bundle.go`