CLI Architecture

View as Markdown

The aicr CLI provides command-line access to AICR configuration management capabilities.

Overview

The CLI provides a four-step workflow for optimizing GPU infrastructure, plus a query command for inspecting hydrated recipe values:

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Snapshot │─────▶│ Recipe │─────▶│ Validate │─────▶│ Bundle │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Capture system Generate optimized Check cluster Create deployment
configuration recommendations compatibility artifacts
┌─────┴──────┐
│ Query │
└────────────┘
Extract hydrated
config values

Step 1: Snapshot Command

Captures system configuration:

  • Operating system: grub, kmod, sysctl, /etc/os-release
  • SystemD services: containerd, docker, kubelet (service state and configuration)
  • Kubernetes: API server version, container images, ClusterPolicy custom resource
  • GPU hardware: driver version, CUDA libraries, MIG configuration, device properties
  • Node topology (cluster-wide taints and labels)

Output destinations:

  • File: --output system.yaml (local filesystem)
  • Stdout: Default (can be piped to other commands)
  • ConfigMap: --output cm://namespace/name (Kubernetes ConfigMap using Kubernetes API)

Agent deployment:

Kubernetes Job runs on GPU nodes. Writes snapshot to ConfigMap via Kubernetes API. Requires ServiceAccount with ConfigMap create/update permissions (Role in target namespace). Does not require PersistentVolume.

Step 2: Recipe Command

Generates optimized configuration recipes with two modes:

  • Query Mode: Direct recipe generation from system parameters (OS, GPU, K8s, etc.)
  • Snapshot Mode: Analyzes captured snapshots and generates tailored recipes based on workload intent (training/inference)

Input Options:

  • Query parameters: --os ubuntu --gpu gb200 --service eks (direct recipe generation)
  • Snapshot file: --snapshot system.yaml (analyze captured snapshot)
  • ConfigMap: --snapshot cm://namespace/name (read from Kubernetes)

Output Options:

  • File: --output recipe.yaml (write to file)
  • Stdout: Default behavior (pipe to bundle command)
  • ConfigMap: --output cm://namespace/name (store in Kubernetes)

Step 3: Validate Command

Validates recipe constraints against actual system measurements from a snapshot.

Input sources:

  • Recipe file: --recipe recipe.yaml (local filesystem)
  • Recipe URL: --recipe https://example.com/recipe.yaml (HTTP/HTTPS)
  • Recipe ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)
  • Snapshot file: --snapshot snapshot.yaml (local filesystem)
  • Snapshot ConfigMap: --snapshot cm://namespace/name (Kubernetes ConfigMap)

Constraint format:

Constraints use fully qualified measurement paths: {Type}.{Subtype}.{Key}

  • K8s.server.version - Kubernetes server version
  • OS.release.ID - Operating system identifier
  • OS.release.VERSION_ID - OS version
  • OS.sysctl./proc/sys/kernel/osrelease - Kernel version

Supported operators:

  • >= 1.30 - Greater than or equal (version comparison)
  • <= 1.33 - Less than or equal (version comparison)
  • > 1.30, < 2.0 - Strict comparison
  • == ubuntu, != rhel - Equality operators
  • ubuntu - Exact string match (no operator)

Output:

  • Validation result with summary (passed/failed/skipped counts)
  • Individual constraint results with expected vs actual values
  • Status: pass, fail, or partial (some skipped)

CI/CD integration:

By default, the command exits with non-zero status when constraints fail (ideal for CI/CD). To run in informational mode without failing:

$aicr validate -r recipe.yaml -s cm://gpu-operator/aicr-snapshot --fail-on-error=false

Step 4: Bundle Command

Generates deployment artifacts from recipes:

  • Helm values files (values.yaml)
  • Kubernetes manifests (ClusterPolicy, NICClusterPolicy, etc.)
  • SHA256 checksum file
  • README documentation: root bundle/README.md is generated by the deployer; per-component bundle/<component>/README.md is generated by each component bundler

Input sources:

  • Recipe file: --recipe recipe.yaml (local filesystem)
  • ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)

Output: Local directory only. ConfigMap output is not supported for bundles.

Current bundlers:

  • GPU Operator: Generates GPU Operator Helm values and ClusterPolicy manifest
  • Network Operator: Generates Network Operator Helm values and NICClusterPolicy manifest
  • Cert-Manager: Generates cert-manager Helm values for certificate management
  • NVSentinel: Generates NVSentinel Helm values
  • Nodewright: Generates Nodewright Operator Helm values and Nodewright CR manifest for node optimization

Value overrides:

The --set flag allows runtime customization of generated bundle values:

$aicr bundle -r recipe.yaml \
> --set gpuoperator:gds.enabled=true \
> --set gpuoperator:driver.version=570.86.16

Node scheduling options:

The bundle command supports node selector and toleration flags for controlling workload placement:

$# Schedule system components (operators, controllers) on specific nodes
$aicr bundle -r recipe.yaml \
> --system-node-selector nodeGroup=system-pool \
> --system-node-toleration dedicated=system:NoSchedule
$
$# Schedule GPU workloads (drivers, device plugins) on GPU nodes
$aicr bundle -r recipe.yaml \
> --accelerated-node-selector nvidia.com/gpu.present=true \
> --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule

Flags:

  • --system-node-selector key=value – Node selector for system components (repeatable)
  • --system-node-toleration key=value:effect – Toleration for system components (repeatable)
  • --accelerated-node-selector key=value – Node selector for GPU nodes (repeatable)
  • --accelerated-node-toleration key=value:effect – Toleration for GPU nodes (repeatable)
  • --nodes N – Estimated number of GPU nodes (bundle-time only; written to paths in registry under nodeScheduling.nodeCountPaths)

These flags apply selectors/tolerations to bundler-specific paths (e.g., GPU Operator uses operator.nodeSelector and daemonsets.nodeSelector). The --nodes value is applied to paths listed in the registry under nodeScheduling.nodeCountPaths.

Execution model:

  • Bundlers run concurrently (parallel execution)
  • All components from the recipe are bundled automatically
  • Errors from any bundler cause immediate cancellation via context propagation

Testing: End-to-end workflow validated by Chainsaw tests in tests/chainsaw/cli/

Architecture Diagram

ConfigMap Integration

The CLI supports Kubernetes-native ConfigMap storage using the cm://namespace/name URI scheme:

Benefits:

  • No file dependencies - Direct Kubernetes API integration
  • Agent-friendly - Jobs can write snapshots without volumes
  • Pipeline integration - CI/CD can read/write ConfigMaps
  • Multi-cluster - Share snapshots/recipes across clusters

RBAC Requirements:

  • ConfigMap read/write permissions in target namespace
  • ServiceAccount with appropriate Role/RoleBinding
  • See Agent Deployment for details

Component Details

Entry Point: cmd/aicr/main.go

Minimal entry point that delegates to the CLI package:

1package main
2
3import "github.com/NVIDIA/aicr/pkg/cli"
4
5func main() {
6 cli.Execute()
7}

Root Command: pkg/cli/root.go

Responsibilities:

  • Command registration and routing
  • Version information injection (via ldflags)
  • Global flag handling (debug mode, log formatting)
  • Logging mode selection and initialization

Key Features:

  • Version info: version, commit, date (overridden at build time)
  • Three logging modes:
    • CLI Mode (default): Minimal output for users (SetDefaultCLILogger)
    • Text Mode (--debug): Full metadata for debugging (SetDefaultLoggerWithLevel)
    • JSON Mode (--log-json): Structured logs for automation (SetDefaultStructuredLoggerWithLevel)
  • Logger selection logic:
    1switch {
    2case c.Bool("log-json"):
    3 logging.SetDefaultStructuredLoggerWithLevel(name, version, logLevel)
    4case isDebug:
    5 logging.SetDefaultLoggerWithLevel(name, version, logLevel)
    6default:
    7 logging.SetDefaultCLILogger(logLevel)
    8}
  • Shell completion support
  • Command listing for auto-completion

Snapshot Command: pkg/cli/snapshot.go

Captures comprehensive system configuration snapshots.

Command Flow

Detailed Data Flow

Usage Examples

$# Output to stdout in JSON format
$aicr snapshot
$
$# Save to file in YAML format
$aicr snapshot --output system.yaml --format yaml
$
$# Human-readable table format
$aicr snapshot --format table
$
$# ConfigMap output (Kubernetes-native)
$aicr snapshot --output cm://gpu-operator/aicr-snapshot

Agent Deployment Pattern

The snapshot command can be deployed as a Kubernetes Job for automated cluster auditing:

Deployment:

1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: aicr
5 namespace: gpu-operator
6spec:
7 template:
8 spec:
9 serviceAccountName: aicr
10 containers:
11 - name: aicr
12 image: ghcr.io/nvidia/aicr:latest
13 command:
14 - aicr
15 - snapshot
16 - --output
17 - cm://gpu-operator/aicr-snapshot
18 restartPolicy: Never

RBAC Requirements:

1apiVersion: v1
2kind: ServiceAccount
3metadata:
4 name: aicr
5 namespace: gpu-operator
6---
7apiVersion: rbac.authorization.k8s.io/v1
8kind: Role
9metadata:
10 name: aicr
11 namespace: gpu-operator
12rules:
13- apiGroups: [""]
14 resources: ["configmaps"]
15 verbs: ["get", "list", "create", "update", "patch"]
16---
17apiVersion: rbac.authorization.k8s.io/v1
18kind: RoleBinding
19metadata:
20 name: aicr
21 namespace: gpu-operator
22roleRef:
23 apiGroup: rbac.authorization.k8s.io
24 kind: Role
25 name: aicr
26subjects:
27- kind: ServiceAccount
28 name: aicr
29 namespace: gpu-operator # Must match ServiceAccount namespace

Key Points:

  • No volumes needed - writes directly via Kubernetes API
  • RBAC RoleBinding must reference correct namespace
  • ConfigMap automatically created if doesn’t exist
  • Supports update pattern (overwrite existing snapshots)
  • RBAC and Job resources are created programmatically by pkg/k8s/agent

Recipe Command: pkg/cli/recipe.go

Generates optimized configuration recipes based on environment parameters.

Command Flow

Detailed Data Flow

Recipe Matching Algorithm

The recipe matching uses an asymmetric rule-based query system where overlay criteria (rules) match against user queries (candidates):

1# Overlay file (eks.yaml)
2spec:
3 criteria:
4 service: eks # Rule: query must have service=eks
5 # Other fields empty = wildcards (match any query value)

Asymmetric Matching Rules:

  1. All non-empty fields in the overlay criteria must be satisfied by the query
  2. Empty overlay field → Wildcard (matches any query value)
  3. Query “any” field → Only matches overlay “any” (does NOT match specific overlays)
  4. Version fields use semantic version equality with precision awareness

This asymmetric behavior ensures generic queries (e.g., --service eks --intent training) don’t match overly specific recipes (e.g., recipes requiring accelerator: gb200).

Usage Examples

$# Basic recipe for Ubuntu with gb200 GPU
$aicr recipe --os ubuntu --gpu gb200
$
$# Full specification with all parameters
$aicr recipe \
> --service eks \
> --accelerator gb200 \
> --intent training \
> --os ubuntu \
> --nodes 8 \
> --format yaml \
> --output recipe.yaml
$
$# Inference workload on GKE
$aicr recipe --service gke --gpu gb200 --intent inference
$
$# Snapshot mode - analyze captured snapshot for training
$aicr recipe --snapshot system.yaml --intent training
$
$# Snapshot mode - analyze for inference optimization
$aicr recipe \
> --snapshot cluster-snapshot.yaml \
> --intent inference \
> --format yaml \
> --output recipe.yaml

Recipe Command Modes

The recipe command supports two modes of operation:

Query Mode (Default)

Direct recipe generation from environment parameters:

Snapshot Mode

Analyze captured snapshots and generate tailored recipes:

Query Extraction from Snapshot

When using snapshot mode, the recipe builder extracts environment parameters from the snapshot:

From OS Measurements:

  • release subtype → OS family (ubuntu, rhel, cos, amazonlinux, talos)

From Kubernetes Measurements:

  • server subtype → K8s service provider (eks, gke, aks) inferred from images

From GPU Measurements:

  • Product Name → GPU type detection (H100, GB200, B200, A100, L40, RTX PRO 6000)
  • Maps product names to normalized accelerator types for recipe matching

Intent Types:

  • training – Optimize for high throughput, batch processing, multi-GPU orchestration
  • inference – Optimize for low latency, single-request performance, efficient batching
  • any – Provides general-purpose recommendations applicable to both workloads

External Data Directory

The --data flag enables extending embedded recipe data with external files:

Requirements:

  • External directory must contain registry.yaml
  • No symlinks allowed (security)
  • Max file size: 10MB per file

Merge Rules:

  • registry.yaml: Components merged by name (external overrides embedded)
  • All other files: External replaces embedded if path matches

Usage Examples

$# Query mode - generate recipe from parameters
$aicr recipe --os ubuntu --service eks --accelerator h100 --intent training
$
$# Snapshot mode - analyze snapshot for training workloads
$aicr recipe --snapshot system.yaml --intent training
$
$# Snapshot mode with output file
$aicr recipe -s system.yaml -i inference -o recipe.yaml
$
$# Query mode with full specification
$aicr recipe \
> --service eks \
> --accelerator gb200 \
> --intent training \
> --os ubuntu \
> --platform kubeflow \
> --nodes 8 \
> --format yaml
$
$# Use external data directory
$aicr recipe --service eks --accelerator h100 --data ./my-custom-data
$
$# Bundle with external data
$aicr bundle --recipe recipe.yaml --data ./my-custom-data --output ./bundles

Recipe Output Structure

1apiVersion: aicr.nvidia.com/v1alpha1
2kind: Recipe
3metadata:
4 version: v1.0.0
5 created: "2025-01-15T10:30:00Z"
6 appliedOverlays:
7 - base
8 - eks
9 - eks-training
10 - gb200-eks-training
11 - gb200-eks-ubuntu-training
12criteria:
13 service: eks
14 accelerator: gb200
15 intent: training
16 os: ubuntu
17 nodes: 8
18componentRefs:
19 - name: gpu-operator
20 version: v25.3.3
21 order: 1
22 repository: https://helm.ngc.nvidia.com/nvidia
23 - name: network-operator
24 version: v25.4.0
25 order: 2
26 repository: https://helm.ngc.nvidia.com/nvidia
27constraints:
28 driver:
29 version: "580.82.07"
30 cudaVersion: "13.1"

Error Handling

  • Query Mode:

    • Invalid parameter values: Returns error with supported options
    • Missing required parameters: Allows “any” as default fallback
    • No matching overlays: Returns recipe with base configuration
  • Snapshot Mode:

    • Missing snapshot file: File not found error with path
    • Invalid snapshot format: Parse error with details
    • Invalid intent: Returns error with supported intent types (training, inference, any)
    • Extraction failures: Best-effort extraction with partial criteria

Common Errors:

  • Unknown output format: Error with supported formats list (json, yaml)

Query Command: pkg/cli/query.go

Extracts specific values from the fully hydrated recipe configuration using dot-path selectors.

Command Flow

Hydration Process

The query command builds a fully hydrated map[string]any from the RecipeResult:

  1. Recipe-level fields (criteria, metadata, deploymentOrder, constraints) are mapped directly
  2. Each ComponentRef is expanded into a component map with metadata fields (name, chart, source, version, etc.)
  3. GetValuesForComponent is called per component to merge base values, overlay values, and inline overrides
  4. The merged values are inlined under each component’s values key

Selector Resolution

The selector uses dot-delimited path walking. Leading dots are stripped (yq-style), so .components.X and components.X are equivalent. An empty selector or . returns the entire hydrated map.

Usage Examples

$# Scalar value — plain text output
$aicr query --service eks --accelerator h100 --intent training \
> --selector components.gpu-operator.values.driver.version
$
$# Subtree — YAML output
$aicr query --service eks --accelerator h100 --intent training \
> --selector components.gpu-operator.values.driver
$
$# Shell-friendly for scripting
$VERSION=$(aicr query --service eks --accelerator h100 --intent training \
> --selector components.gpu-operator.values.driver.version)

Implementation: pkg/recipe/query.go (HydrateResult, Select)

Bundle Command: pkg/cli/bundle.go

Generates deployment-ready bundles (Helm values, Kubernetes manifests, installation scripts) from recipes.

Command Flow

Detailed Data Flow

Bundler Data Flow

Simplified Architecture (RecipeResult-to-Template):

Key Simplification: Single RecipeResult path (no dual Recipe/RecipeResult routing)
Data Flow: RecipeResult → Values Map + ScriptData → Templates
Templates: Use index .Values "key" for config, .Script.* for metadata

Bundler Architecture

BaseBundler Helper Pattern
1// Bundlers embed BaseBundler and override Make()
2type Bundler struct {
3 *bundler.BaseBundler // Provides common functionality
4}
5
6func NewBundler() *Bundler {
7 return &Bundler{
8 BaseBundler: bundler.NewBaseBundler(bundlerType, templatesFS),
9 }
10}
11
12// Self-register at init time using MustRegister
13func init() {
14 bundler.MustRegister("gpu-operator", NewBundler())
15}
RecipeResult-Based Data Access
1// Get component reference from RecipeResult
2component := input.GetComponentRef(Name)
3values := input.GetValuesForComponent(Name)
4
5// Generate script metadata
6scriptData := generateScriptData(component, values)
7
8// Pass values map to templates (config values)
9b.GenerateFileFromTemplate(ctx, GetTemplate, "values.yaml", path, values, 0644)
10
11// Pass ScriptData to scripts (metadata)
12b.GenerateFileFromTemplate(ctx, GetTemplate, "install.sh", path, scriptData, 0755)
13
14// Pass combined data to README
15readmeData := map[string]interface{}{"Values": values, "Script": scriptData}
16b.GenerateFileFromTemplate(ctx, GetTemplate, "README.md", path, readmeData, 0644)

Data Flow: RecipeResult → Values/ScriptData → Template

RecipeResult → GetComponentRef(Name) → ComponentRef
→ GetValuesForComponent(Name) → values map
→ generateScriptData() → ScriptData struct
→ Template ({{ index .Values "key" }} or {{ .Script.Namespace }})
Registry Pattern
1// Dynamic bundler discovery
2bundlers := defaultRegistry.GetAll() // Returns all registered bundlers
3bundlers := defaultRegistry.Get(type) // Returns specific bundler
4
5// MustRegister panics on duplicate types (fail-fast)
6bundler.MustRegister("gpu-operator", NewBundler())

DefaultBundler Options:

  • WithBundlerTypes([]BundleType) – Specify bundler types (empty = all registered)
  • WithFailFast(bool) – Stop on first error (default: false/collect all)
  • WithConfig(*Config) – Provide bundler configuration
  • WithRegistry(*Registry) – Use custom bundler registry

Execution:

  • Parallel execution by default: Uses errgroup.WithContext for concurrent execution
    • All bundlers run concurrently when no types specified
    • Faster for multiple bundlers
    • Context cancellation propagates to all bundlers
    • Bundlers are stateless (thread-safe by design)
    • BaseBundler provides thread-safe operations

Architecture Benefits:

  • 75% less code per bundler (BaseBundler eliminates boilerplate)
  • 34% less test code (TestHarness standardizes testing)
  • 15+ internal helpers for recipe parsing
  • Automatic registration via init() functions
  • Fail-fast on duplicate bundler types

Usage Examples

$# Generate all recipe components (parallel by default)
$aicr bundle --recipe recipe.yaml --output ./bundles
$
$# Use short flags
$aicr bundle -r recipe.yaml -o ./bundles
$
$# Override values at generation time
$aicr bundle -r recipe.yaml \
> --set gpuoperator:gds.enabled=true \
> --set gpuoperator:driver.version=570.86.16 \
> -o ./bundles
$
$# Override values for multiple components
$aicr bundle -r recipe.yaml \
> --set gpuoperator:mig.strategy=mixed \
> --set networkoperator:rdma.enabled=true \
> -o ./bundles
$
$# Schedule system components on system node pool
$aicr bundle -r recipe.yaml \
> --system-node-selector nodeGroup=system-pool \
> --system-node-toleration dedicated=system:NoSchedule \
> -o ./bundles
$
$# Schedule GPU workloads on labeled GPU nodes
$aicr bundle -r recipe.yaml \
> --accelerated-node-selector nvidia.com/gpu.present=true \
> --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
> -o ./bundles

Bundle Output Structure

./bundles/
├── gpu-operator/
│ ├── values.yaml # Helm chart values
│ ├── manifests/
│ │ └── clusterpolicy.yaml # ClusterPolicy CR
│ ├── scripts/
│ │ ├── install.sh # Installation script
│ │ └── uninstall.sh # Cleanup script
│ ├── README.md # Deployment instructions
│ └── checksums.txt # SHA256 verification
├── network-operator/
│ ├── values.yaml
│ ├── manifests/
│ │ └── nicclusterpolicy.yaml
│ ├── scripts/
│ ├── README.md
│ └── checksums.txt
├── cert-manager/
│ ├── values.yaml
│ ├── README.md
│ └── checksums.txt
├── nvsentinel/
│ ├── values.yaml
│ ├── README.md
│ └── checksums.txt
└── nodewright-operator/
├── values.yaml
├── manifests/
│ └── nodewright.yaml
├── README.md
└── checksums.txt

Error Handling

Validation Errors:

  • Missing recipe file: File not found error with path
  • Invalid recipe format: Parse error with details
  • Invalid bundler type: Error with list of supported types
  • Empty measurements: Recipe validation failure

Execution Errors:

  • FailFast=false (default): Collects all errors, continues execution
    • Returns partial results with error list
    • Exit code indicates failure count
  • FailFast=true: Stops on first bundler error
    • Returns immediately with error
    • Subsequent bundlers not executed

Common Error Scenarios:

$# Missing recipe file
$$ aicr bundle --output ./bundles
$Error: required flag "recipe" not set
$
$# Bundler failures (FailFast=false)
$$ aicr bundle -r recipe.yaml
$Error: bundle generation completed with errors: 1/2 bundlers failed

CLI Integration

The bundle command integrates with the CLI through:

  1. Shared Serializer: Uses same serializer.FromFile for recipe loading
  2. Structured Logging: Consistent slog structured logging
  3. Context Propagation: Respects context cancellation
  4. Error Patterns: Uses same error handling conventions

Log Output Example:

INFO generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTypes=[gpu-operator]
INFO starting bundle generation bundler_count=1 output_dir=./bundles
INFO bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms
INFO bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
INFO bundle generation completed success=1 errors=0 duration_sec=0.045 summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."

Common Errors:

Shared Infrastructure

Collector Factory Pattern

The CLI uses the Factory Pattern for collector instantiation, enabling:

  • Testability: Inject mock collectors for unit tests
  • Flexibility: Easy to add new collector types
  • Encapsulation: Hide collector creation complexity
1type Factory interface {
2 CreateSystemDCollector() Collector
3 CreateOSCollector() Collector
4 CreateKubernetesCollector() Collector
5 CreateGPUCollector() Collector
6}

Serializer Abstraction

Output formatting is abstracted through the serializer.Serializer interface:

1type Serializer interface {
2 Serialize(data interface{}) error
3}

Implementations:

  • JSON: encoding/json with 2-space indent
  • YAML: gopkg.in/yaml.v3
  • Table: text/tabwriter for columnar display

Measurement Data Model

All collected data uses a unified measurement.Measurement structure:

1type Measurement struct {
2 Type Type // os, k8s, systemd, gpu
3 Subtypes []Subtype // Named collections of readings
4}
5
6type Subtype struct {
7 Name string // grub, kmod, sysctl, server, image, etc.
8 Data map[string]Reading // Key-value readings
9 Context map[string]string // Human-readable descriptions
10}
11
12type Reading struct {
13 Value interface{} // Actual value (int, string, bool, float64)
14}

Error Handling

CLI Error Strategy

  1. Flag Validation: User-friendly error messages for invalid flags
  2. Version Parsing: Specific error types (ErrNegativeComponent, etc.)
  3. Collector Failures: Log errors, continue with partial data where possible
  4. Serialization Errors: Fatal - abort and report
  5. Exit Codes: Non-zero exit code on any failure

Example Error Messages

$# Invalid accelerator type
$$ aicr recipe --accelerator invalid-gpu
$[cli] command failed: error=[INTERNAL] error building recipe: [INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] failed to apply criteria option: [INVALID_REQUEST] failed to parse accelerator type: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=8
$
$# Unknown output format
$$ aicr snapshot --format xml
$Error: unknown output format: "xml"
$
$# Missing required parameters
$$ aicr recipe
$# Still succeeds - generates base recipe with no overlays

Performance Characteristics

Snapshot Command

  • Parallel Collection: All collectors run concurrently via errgroup
  • Typical Duration: 100-500ms depending on cluster size
  • Memory Usage: ~10-50MB for typical workloads
  • Scalability: O(n) with number of pods/nodes for K8s collector

Recipe Command

  • Store Loading: Once per process (cached via sync.Once)
  • Typical Duration: <10ms after initial load
  • Memory Usage: ~5-10MB (embedded YAML + parsed structure)
  • Scalability: O(m) with number of overlays (typically <100)

Build Configuration

Version Injection via ldflags

Build-time version information injection:

1VERSION ?= $(shell git describe --tags --always --dirty)
2COMMIT ?= $(shell git rev-parse --short HEAD)
3DATE ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)
4
5LDFLAGS := -X github.com/NVIDIA/aicr/pkg/cli.version=$(VERSION)
6LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.commit=$(COMMIT)
7LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.date=$(DATE)
8
9go build -ldflags="$(LDFLAGS)" -o bin/aicr ./cmd/aicr

Testing Strategy

Unit Tests

  • Flag parsing and validation
  • Version parsing and error handling
  • Query building from command flags
  • Serializer format selection

Integration Tests

  • Mock collectors for deterministic output
  • Full command execution with fake factory
  • Output format validation

Example Test Structure

1func TestSnapshotCommand(t *testing.T) {
2 // Create mock factory
3 mockFactory := &MockFactory{
4 k8s: mockK8sCollector,
5 systemd: mockSystemDCollector,
6 os: mockOSCollector,
7 gpu: mockGPUCollector,
8 }
9
10 // Execute snapshot with mock
11 snapshotter := NodeSnapshotter{
12 Factory: mockFactory,
13 Serializer: &bytes.Buffer{},
14 }
15
16 err := snapshotter.Measure(ctx)
17 assert.NoError(t, err)
18}

Dependencies

External Libraries

  • github.com/urfave/cli/v3 - CLI framework
  • golang.org/x/sync/errgroup - Concurrent error handling
  • gopkg.in/yaml.v3 - YAML parsing
  • log/slog - Structured logging

Internal Packages

  • pkg/collector - System data collection
  • pkg/measurement - Data model
  • pkg/recipe - Recipe building
  • pkg/version - Semantic versioning
  • pkg/serializer - Output formatting
  • pkg/logging - Logging configuration
  • pkg/snapshotter - Snapshot orchestration

Future Enhancements

Short-Term (< 3 months)

  1. Caching Layer
    Rationale: Reduce latency for repeated aicr snapshot calls in scripts
    Implementation: sync.Map with TTL-based eviction using time.AfterFunc
    Trade-off: Stale data risk vs 5-10x performance improvement
    Reference: sync.Map

  2. Differential Snapshots
    Use Case: CI/CD pipelines detecting configuration drift
    Implementation: github.com/google/go-cmp/cmp for deep comparison
    Output: JSON Patch (RFC 6902) format for machine consumption
    CLI: aicr diff baseline.yaml current.yaml --format patch

  3. Measurement Filtering
    Use Case: Extract only GPU data without K8s overhead
    CLI: aicr snapshot --filter gpu,os --exclude k8s
    Implementation: Post-collection filtering before serialization
    Performance: Saves 60-70% execution time when K8s excluded

  4. Batch Mode
    Use Case: Fleet-wide configuration auditing (100s of nodes)
    Implementation: Worker pool with errgroup.SetLimit()
    CLI: aicr snapshot --nodes nodes.txt --workers 10 --output results/
    Reference: errgroup Limits

Mid-Term (3-6 months)

  1. Plugin System
    Rationale: Custom collectors without forking codebase
    Interface: type Collector interface { Collect(context.Context) (Measurement, error) }
    Options: Go plugins (unstable across versions) or WASM (safe, portable)
    Security: Sandboxed execution with restricted syscalls
    Reference: WebAssembly System Interface

  2. Configuration Files
    Use Case: Avoid repeating —os, —gpu flags
    Format: YAML following XDG Base Directory spec
    Location: ~/.config/aicr/config.yaml (Linux/macOS), %APPDATA%\aicr\config.yaml (Windows)
    Example:

    1defaults:
    2 os: ubuntu
    3 gpu: h100
    4 format: yaml
    5server:
    6 url: https://recipe-api.example.com
  3. Watch Mode
    Implementation: Hybrid of fsnotify + periodic polling
    CLI: aicr snapshot --watch --interval 30s --on-change ./alert.sh
    Output: Stream of JSON diffs to stdout
    Use Case: Real-time monitoring with alerting

  4. Schema Validation
    Use Case: Ensure snapshots conform to API version spec
    Implementation: Embed JSON Schema in binary with go:embed
    Library: github.com/santhosh-tekuri/jsonschema/v5 (fastest Go validator)
    CLI: aicr validate --schema v1 snapshot.json

Long-Term (6-12 months)

  1. gRPC Mode
    Rationale: Better streaming, 3-5x smaller payloads than JSON
    Implementation: Bi-directional streaming with protobuf
    Trade-off: Added complexity (proto definitions) vs performance gains
    Reference: gRPC Go

  2. Distributed Tracing
    Use Case: Debug performance issues across collectors
    Implementation: OpenTelemetry SDK with span per collector
    Exporter: OTLP to Jaeger/Tempo
    CLI: aicr snapshot --trace --trace-endpoint localhost:4317
    Reference: OpenTelemetry Go

  3. Policy Enforcement
    Use Case: Block non-compliant configs in CI/CD
    Implementation: Embed OPA (github.com/open-policy-agent/opa)
    CLI: aicr validate --policy policy.rego snapshot.yaml
    Exit Code: 0 = pass, 1 = policy violations
    Reference: OPA Go Integration

  4. Cloud Storage Integration
    Use Case: Centralized storage for fleet management
    CLI: aicr snapshot --upload s3://bucket/snapshots/$(hostname).yaml
    Implementation: AWS SDK v2 with resumable uploads
    Authentication: IAM roles, service accounts, credential chain
    Reference: AWS SDK for Go V2

Production Deployment Patterns

Pattern 1: CI/CD Integration

Use Case: Automated configuration validation in build pipelines

GitLab CI Example:

1validate_gpu_config:
2 stage: test
3 image: ghcr.io/nvidia/aicr:latest
4 script:
5 - aicr snapshot --format json > snapshot.json
6 # Validate against known-good baseline
7 - diff -u expected_snapshot.json snapshot.json
8 # Or use OPA policy (future enhancement)
9 # - aicr validate --policy policies/gpu_baseline.rego snapshot.json
10 only:
11 - merge_requests
12 artifacts:
13 when: on_failure
14 paths:
15 - snapshot.json

GitHub Actions Example:

1name: Validate GPU Configuration
2on:
3 pull_request:
4 paths:
5 - 'ansible/**'
6 - 'terraform/**'
7
8jobs:
9 validate:
10 runs-on: [self-hosted, gpu]
11 steps:
12 - uses: actions/checkout@v4
13
14 - name: Install aicr
15 run: |
16 curl -sfL https://raw.githubusercontent.com/.../installer | bash -s --
17 echo "$HOME/.local/bin" >> $GITHUB_PATH
18
19 - name: Capture snapshot
20 run: aicr snapshot --format yaml --output snapshot.yaml
21
22 - name: Generate recipe
23 run: aicr recipe --os ubuntu --gpu h100 > recipe.yaml
24
25 - name: Compare configurations
26 run: |
27 yq eval '.measurements[] | select(.type=="GPU")' snapshot.yaml > actual_gpu.yaml
28 yq eval '.measurements[] | select(.type=="GPU")' recipe.yaml > expected_gpu.yaml
29 diff -u expected_gpu.yaml actual_gpu.yaml || \
30 (echo "::error::GPU configuration drift detected" && exit 1)
31
32 - name: Upload artifact
33 if: failure()
34 uses: actions/upload-artifact@v4
35 with:
36 name: configuration-drift
37 path: |
38 snapshot.yaml
39 recipe.yaml

Jenkins Pipeline:

1pipeline {
2 agent { label 'gpu-node' }
3
4 stages {
5 stage('Snapshot') {
6 steps {
7 sh 'aicr snapshot --format json > snapshot.json'
8 }
9 }
10
11 stage('Validate') {
12 steps {
13 script {
14 def snapshot = readJSON file: 'snapshot.json'
15 def gpuDriver = snapshot.measurements
16 .find { it.type == 'GPU' }
17 .subtypes.find { it.subtype == 'smi' }
18 .data.'driver-version'
19
20 if (gpuDriver != '570.158.01') {
21 error("Incorrect GPU driver: ${gpuDriver}")
22 }
23 }
24 }
25 }
26 }
27
28 post {
29 always {
30 archiveArtifacts artifacts: 'snapshot.json', fingerprint: true
31 }
32 }
33}

Pattern 2: Scheduled Auditing

Use Case: Nightly configuration drift detection across fleet

Kubernetes CronJob:

1apiVersion: batch/v1
2kind: CronJob
3metadata:
4 name: aicr-audit
5 namespace: monitoring
6spec:
7 schedule: "0 2 * * *" # 2 AM daily
8 concurrencyPolicy: Forbid # Prevent overlapping runs
9 successfulJobsHistoryLimit: 7
10 failedJobsHistoryLimit: 3
11 jobTemplate:
12 spec:
13 template:
14 metadata:
15 labels:
16 app: aicr-audit
17 spec:
18 serviceAccountName: aicr
19 nodeSelector:
20 node-role.kubernetes.io/gpu: "true"
21 tolerations:
22 - key: nvidia.com/gpu
23 operator: Exists
24 effect: NoSchedule
25 containers:
26 - name: aicr
27 image: ghcr.io/nvidia/aicr:v0.6.4
28 command:
29 - /bin/sh
30 - -c
31 - |
32 set -e
33 TIMESTAMP=$(date +%Y%m%d-%H%M%S)
34 HOSTNAME=$(hostname)
35
36 # Capture snapshot
37 aicr snapshot --format yaml > /tmp/snapshot.yaml
38
39 # Store as ConfigMap with retention
40 kubectl create configmap \
41 "aicr-snapshot-${HOSTNAME}-${TIMESTAMP}" \
42 --from-file=snapshot=/tmp/snapshot.yaml \
43 --dry-run=client -o yaml | \
44 kubectl apply -f -
45
46 # Cleanup old snapshots (keep last 30 days)
47 kubectl get configmaps -l aicr-snapshot=true \
48 --sort-by=.metadata.creationTimestamp | \
49 head -n -30 | \
50 xargs -r kubectl delete configmap
51 resources:
52 limits:
53 memory: 256Mi
54 requests:
55 cpu: 100m
56 memory: 128Mi
57 restartPolicy: OnFailure

Systemd Timer (Bare Metal):

1# /etc/systemd/system/aicr-audit.service
2[Unit]
3Description=AICR Configuration Audit
4After=network.target
5
6[Service]
7Type=oneshot
8ExecStart=/usr/local/bin/aicr snapshot --format json --output /var/log/aicr/snapshot-%Y%m%d.json
9User=aicr
10Group=aicr
11
12# Hardening
13PrivateTmp=true
14NoNewPrivileges=true
15ReadOnlyPaths=/usr /etc
16ReadWritePaths=/var/log/aicr
17
18[Install]
19WantedBy=multi-user.target
20
21# /etc/systemd/system/aicr-audit.timer
22[Unit]
23Description=AICR Audit Timer
24
25[Timer]
26OnCalendar=daily
27Persistent=true
28
29[Install]
30WantedBy=timers.target

Enable with:

$sudo systemctl enable --now aicr-audit.timer
$sudo systemctl list-timers aicr-audit.timer

Pattern 3: Fleet Management

Use Case: Collect snapshots from 100s of GPU nodes in parallel

Ansible Playbook:

1---
2- name: Collect AICR Snapshots from GPU Fleet
3 hosts: gpu_nodes
4 gather_facts: yes
5 serial: 10 # Process 10 nodes at a time
6 tasks:
7 - name: Ensure aicr is installed
8 stat:
9 path: /usr/local/bin/aicr
10 register: aicr_binary
11 failed_when: not aicr_binary.stat.exists
12
13 - name: Collect snapshot
14 shell: aicr snapshot --format json
15 register: snapshot
16 changed_when: false
17 failed_when: snapshot.rc != 0
18
19 - name: Upload to S3
20 aws_s3:
21 bucket: fleet-snapshots
22 object: "{{ inventory_hostname }}/{{ ansible_date_time.iso8601 }}.json"
23 content: "{{ snapshot.stdout }}"
24 mode: put
25 delegate_to: localhost
26 run_once: false
27
28 - name: Validate against baseline
29 shell: |
30 echo '{{ snapshot.stdout }}' | \
31 jq '.measurements[] | select(.type=="GPU") | .subtypes[] |
32 select(.subtype=="smi") | .data."driver-version"'
33 register: driver_version
34 failed_when: driver_version.stdout != '"570.158.01"'
35 changed_when: false
36
37- name: Generate Fleet Report
38 hosts: localhost
39 tasks:
40 - name: Download all snapshots
41 aws_s3:
42 bucket: fleet-snapshots
43 mode: list
44 register: s3_objects
45
46 - name: Aggregate results
47 script: scripts/aggregate_snapshots.py
48 args:
49 snapshots: "{{ s3_objects.s3_keys }}"

Terraform Provisioning:

1resource "null_resource" "aicr_snapshot" {
2 count = length(var.gpu_instance_ids)
3
4 provisioner "remote-exec" {
5 inline = [
6 "aicr snapshot --format json > /tmp/snapshot.json",
7 "aws s3 cp /tmp/snapshot.json s3://fleet-snapshots/${self.id}/"
8 ]
9
10 connection {
11 type = "ssh"
12 host = element(var.gpu_instance_ips, count.index)
13 user = "ubuntu"
14 private_key = file("~/.ssh/id_rsa")
15 }
16 }
17
18 triggers = {
19 instance_id = element(var.gpu_instance_ids, count.index)
20 timestamp = timestamp()
21 }
22}
23
24data "aws_s3_objects" "snapshots" {
25 bucket = "fleet-snapshots"
26 depends_on = [null_resource.aicr_snapshot]
27}
28
29output "snapshot_count" {
30 value = length(data.aws_s3_objects.snapshots.keys)
31}

Pattern 4: Real-Time Monitoring

Use Case: Continuous configuration monitoring with Prometheus alerting

Prometheus Exporter (future enhancement):

1package main
2
3import (
4 "context"
5 "net/http"
6 "time"
7
8 "github.com/prometheus/client_golang/prometheus"
9 "github.com/prometheus/client_golang/prometheus/promhttp"
10 "github.com/NVIDIA/aicr/pkg/snapshotter"
11)
12
13var (
14 gpuDriverVersion = prometheus.NewGaugeVec(
15 prometheus.GaugeOpts{
16 Name: "aicr_gpu_driver_version",
17 Help: "NVIDIA driver version (encoded as float)",
18 },
19 []string{"node", "gpu_model"},
20 )
21
22 k8sVersion = prometheus.NewGaugeVec(
23 prometheus.GaugeOpts{
24 Name: "aicr_k8s_version",
25 Help: "Kubernetes version (encoded)",
26 },
27 []string{"node"},
28 )
29)
30
31func init() {
32 prometheus.MustRegister(gpuDriverVersion, k8sVersion)
33}
34
35func collectMetrics() {
36 ticker := time.NewTicker(30 * time.Second)
37 defer ticker.Stop()
38
39 for range ticker.C {
40 ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
41 snapshot, err := snapshotter.Measure(ctx)
42 cancel()
43
44 if err != nil {
45 log.Printf("Snapshot failed: %v", err)
46 continue
47 }
48
49 // Extract and export GPU driver version
50 for _, m := range snapshot.Measurements {
51 if m.Type == "GPU" {
52 for _, st := range m.Subtypes {
53 if st.Subtype == "smi" {
54 version := st.Data["driver-version"]
55 encoded := encodeVersion(version)
56 gpuModel := st.Data["gpu-name"]
57 gpuDriverVersion.WithLabelValues(hostname, gpuModel).Set(encoded)
58 }
59 }
60 }
61 }
62 }
63}
64
65func main() {
66 go collectMetrics()
67 http.Handle("/metrics", promhttp.Handler())
68 http.ListenAndServe(":9090", nil)
69}

Prometheus Alerting Rules:

1groups:
2- name: aicr_configuration
3 interval: 60s
4 rules:
5 - alert: GPUDriverVersionMismatch
6 expr: |
7 count(count by (aicr_gpu_driver_version) (aicr_gpu_driver_version)) > 1
8 for: 5m
9 labels:
10 severity: warning
11 annotations:
12 summary: "Multiple GPU driver versions detected in cluster"
13 description: "{{ $value }} different driver versions found"
14
15 - alert: KubernetesVersionSkew
16 expr: |
17 abs(aicr_k8s_version - scalar(avg(aicr_k8s_version))) > 0.01
18 for: 10m
19 labels:
20 severity: critical
21 annotations:
22 summary: "Kubernetes version skew detected on {{ $labels.node }}"
23 description: "Node version differs from cluster average"

Advanced Usage Patterns

Snapshot Diffing with jq

$#!/bin/bash
$# Capture baseline before changes
$aicr snapshot --format json > baseline.json
$
$# Apply configuration changes (Ansible, Terraform, etc.)
$# ...
$
$# Capture new snapshot
$aicr snapshot --format json > current.json
$
$# Diff specific sections
$echo "=== GPU Configuration Changes ==="
$diff -u \
> <(jq -S '.measurements[] | select(.type=="GPU")' baseline.json) \
> <(jq -S '.measurements[] | select(.type=="GPU")' current.json)
$
$echo "=== Kernel Parameter Changes ==="
$diff -u \
> <(jq -S '.measurements[] | select(.type=="os") | .subtypes[] |
> select(.subtype=="sysctl")' baseline.json) \
> <(jq -S '.measurements[] | select(.type=="os") | .subtypes[] |
> select(.subtype=="sysctl")' current.json)
$
$# Count total changes
$changes=$(diff <(jq -S . baseline.json) <(jq -S . current.json) | grep -c '^[<>]')
$echo "Total configuration changes: $changes"

Recipe Generation Pipeline

$#!/bin/bash
$# Generate recipes for all supported configurations
$
$set -euo pipefail
$
$OUTPUT_DIR="recipes"
$mkdir -p "$OUTPUT_DIR"
$
$# GPU types from NVIDIA product line
$GPU_TYPES=("h100" "gb200" "b200" "a100" "l40" "rtx-pro-6000")
$
$# Kubernetes services
$K8S_SERVICES=("eks" "gke" "aks" "oke" "kind" "lke")
$
$# OS distributions
$OS_TYPES=("ubuntu" "rhel" "cos")
$
$total=0
$for gpu in "${GPU_TYPES[@]}"; do
$ for service in "${K8S_SERVICES[@]}"; do
$ for os in "${OS_TYPES[@]}"; do
$ output="${OUTPUT_DIR}/${os}-${service}-${gpu}.yaml"
$
$ # Generate recipe
$ if aicr recipe --os "$os" --service "$service" --gpu "$gpu" \
> --format yaml > "$output" 2>/dev/null; then
$ echo "✓ Generated $output"
$ ((total++))
$ else
$ echo "✗ Failed: $os $service $gpu"
$ fi
$ done
$ done
$done
$
$echo "Generated $total recipes"
$
$# Validate all recipes
$echo "Validating recipes..."
$find "$OUTPUT_DIR" -name '*.yaml' -exec yq eval '.' {} \; > /dev/null
$echo "All recipes valid"
$
$# Create index
$cat > "$OUTPUT_DIR/README.md" <<EOF
$# Configuration Recipes
$
$Generated on $(date -Iseconds)
$
$Total recipes: $total
$
$## Available Configurations
$
$| OS | Service | GPU | File |
$|----|---------|-----|------|
$EOF
$
$find "$OUTPUT_DIR" -name '*.yaml' -type f | sort | while read -r file; do
$ base=$(basename "$file" .yaml)
$ IFS='-' read -ra parts <<< "$base"
$ echo "| ${parts[0]} | ${parts[1]} | ${parts[2]} | $file |" >> "$OUTPUT_DIR/README.md"
$done

Automated Remediation

$#!/bin/bash
$# Apply recommended configuration from recipe
$# WARNING: Modifies system configuration - use with caution
$
$set -euo pipefail
$
$# Capture current state
$current=$(aicr snapshot --format json)
$
$# Generate recommended recipe
$recipe=$(aicr recipe --os ubuntu --gpu h100 --format json)
$
$# Extract recommended GRUB parameters
$recommended_grub=$(echo "$recipe" | jq -r '
> .measurements[] |
> select(.type=="os") |
> .subtypes[] |
> select(.subtype=="grub") |
> .data |
> to_entries[] |
> "\(.key)=\(.value)"' | tr '\n' ' ')
$
$# Extract current GRUB parameters
$current_grub=$(echo "$current" | jq -r '
> .measurements[] |
> select(.type=="os") |
> .subtypes[] |
> select(.subtype=="grub") |
> .data |
> to_entries[] |
> "\(.key)=\(.value)"' | tr '\n' ' ')
$
$# Show diff
$echo "Current GRUB parameters:"
$echo "$current_grub"
$echo ""
$echo "Recommended GRUB parameters:"
$echo "$recommended_grub"
$echo ""
$
$# Prompt for confirmation
$read -p "Apply changes? (yes/no): " confirm
$if [[ "$confirm" != "yes" ]]; then
$ echo "Aborted"
$ exit 0
$fi
$
$# Apply GRUB changes (requires root)
$sudo grubby --update-kernel=ALL --args="$recommended_grub"
$echo "GRUB configuration updated. Reboot required."
$
$# Apply sysctl changes
$echo "$recipe" | jq -r '
> .measurements[] |
> select(.type=="os") |
> .subtypes[] |
> select(.subtype=="sysctl") |
> .data |
> to_entries[] |
> "\(.key) = \(.value)"' | \
>sudo tee /etc/sysctl.d/99-aicr-recommended.conf
$
$sudo sysctl --system
$echo "Sysctl parameters applied"
$
$# Log changes
$echo "$(date -Iseconds): Applied AICR recommendations" | \
>sudo tee -a /var/log/aicr-remediation.log

Troubleshooting Guide

Issue: “nvidia-smi not found”

Symptoms: GPU measurements empty, error in logs
Root Cause: NVIDIA driver not installed or not in PATH

Diagnosis:

$# Check if nvidia-smi exists
$which nvidia-smi
$# Expected: /usr/bin/nvidia-smi
$
$# Verify driver installation
$nvidia-smi --version
$# Expected: NVIDIA-SMI 570.158.01
$
$# Check kernel modules
$lsmod | grep nvidia
$# Expected: nvidia, nvidia_uvm, nvidia_modeset
$
$# Verify device nodes
$ls -l /dev/nvidia*
$# Expected: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm

Resolution:

$# Ubuntu: Install NVIDIA driver
$sudo apt-get update
$sudo apt-get install -y nvidia-driver-570
$
$# RHEL: Install from CUDA repo
$sudo dnf config-manager --add-repo \
> https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
$sudo dnf install -y nvidia-driver:570
$
$# Verify installation
$sudo nvidia-smi
$
$# If PATH issue, add to shell profile
$echo 'export PATH="/usr/bin:$PATH"' >> ~/.bashrc
$source ~/.bashrc

Issue: “Kubernetes API server unreachable”

Symptoms: K8s measurements empty, “connection refused” error
Root Cause: Not running in cluster, or kubeconfig missing/invalid

Diagnosis:

$# Verify cluster connectivity
$kubectl cluster-info
$# Expected: Kubernetes control plane is running at https://...
$
$# Check kubeconfig
$echo $KUBECONFIG
$cat ~/.kube/config
$
$# Test API access
$kubectl get nodes
$# Expected: List of nodes
$
$# Check service account (in-cluster)
$ls -l /var/run/secrets/kubernetes.io/serviceaccount/
$# Expected: token, ca.crt, namespace

Resolution:

$# Option 1: Set KUBECONFIG explicitly
$export KUBECONFIG=~/.kube/config
$aicr snapshot
$
$# Option 2: Copy admin kubeconfig
$sudo cp /etc/kubernetes/admin.conf ~/.kube/config
$sudo chown $(id -u):$(id -g) ~/.kube/config
$
$# Option 3: Use service account token (in-cluster)
$kubectl create serviceaccount aicr
$kubectl create clusterrolebinding aicr --clusterrole=view --serviceaccount=default:aicr
$
$# Option 4: Debug with kubectl proxy
$kubectl proxy &
$export KUBERNETES_SERVICE_HOST=localhost
$export KUBERNETES_SERVICE_PORT=8001
$aicr snapshot

Issue: “Snapshot too slow (> 5s)”

Symptoms: Long execution time, timeouts in CI/CD
Root Cause: Large cluster (1000s of pods), slow API server, many GPUs

Diagnosis:

$# Enable debug logging to identify slow collectors
$aicr --debug snapshot 2>&1 | grep -E 'collector|duration'
$# Expected output shows timing per collector:
$# time="..." level=debug msg="k8s collector finished" duration=3.2s
$# time="..." level=debug msg="gpu collector finished" duration=0.8s
$
$# Check cluster size
$kubectl get pods --all-namespaces --no-headers | wc -l
$# Large: > 1000 pods
$
$# Check GPU count
$nvidia-smi --list-gpus | wc -l
$# Many: > 8 GPUs
$
$# Profile execution
$time aicr snapshot > /dev/null

Resolution:

$# Option 1: Filter to specific collectors (future enhancement)
$aicr snapshot --filter gpu,os # Skip K8s (saves 60-70% time)
$
$# Option 2: Increase timeout (future enhancement)
$aicr snapshot --timeout 30s
$
$# Option 3: Use caching for repeated calls
$aicr snapshot > /tmp/snapshot.json
$# Reuse /tmp/snapshot.json for subsequent analysis
$
$# Option 4: Optimize K8s collector
$# Reduce API calls by using label selectors (code change):
$# clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions{
># LabelSelector: "app=gpu-operator",
># })
$
$# Option 5: Run in parallel with errgroup limit
$# Already implemented in code, but can tune:
$# g.SetLimit(runtime.NumCPU()) // Current: 2

Issue: “Out of memory during snapshot”

Symptoms: Process killed, OOMKilled in K8s, segfault
Root Cause: Large measurement data (10k+ pods, many images)

Diagnosis:

$# Check memory usage during snapshot
$/usr/bin/time -v aicr snapshot > /dev/null 2>&1
$# Look for "Maximum resident set size"
$
$# Monitor memory in real-time
$# Terminal 1:
$watch -n 1 'ps aux | grep aicr'
$# Terminal 2:
$aicr snapshot
$
$# In Kubernetes, check OOMKilled events
$kubectl get events --field-selector reason=OOMKilling

Resolution:

$# Option 1: Use streaming serialization (already implemented)
$# Data never fully materialized in memory
$aicr snapshot --format json > snapshot.json
$
$# Option 2: Increase memory limit in Kubernetes
$kubectl set resources deployment aicr-agent \
> --limits=memory=1Gi \
> --requests=memory=512Mi
$
$# Option 3: Filter measurements (future enhancement)
$aicr snapshot --filter gpu,os # Exclude large K8s data
$
$# Option 4: Optimize code to reduce allocations
$# Use object pooling for repeated structs:
$var measurementPool = sync.Pool{
> New: func() interface{} {
> return &measurement.Measurement{}
> },
>}
$
$# Option 5: Process in batches (code change needed)
$# For K8s pods, paginate API calls:
$pods, err := clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions{
> Limit: 100,
> Continue: continueToken,
>})

Performance Tuning

CPU Profiling

$# Build with profiling enabled
$mkdir -p bin
$go build -o bin/aicr cmd/aicr/main.go
$
$# Capture CPU profile
$./bin/aicr snapshot --cpuprofile=cpu.prof
$
$# Analyze profile
$go tool pprof cpu.prof
$(pprof) top10
$# Shows top 10 functions by CPU time
$
$(pprof) list collectContainerImages
$# Shows line-by-line CPU usage in specific function
$
$(pprof) web
$# Opens interactive graph in browser (requires graphviz)
$
$# Example output interpretation:
$# If collectContainerImages is > 50% CPU:
$# - Optimize pod iteration
$# - Reduce string allocations
$# - Cache image parsing results

Memory Profiling

$# Capture memory profile
$./bin/aicr snapshot --memprofile=mem.prof
$
$# Analyze allocations
$go tool pprof -alloc_space mem.prof
$(pprof) top10
$# Shows top 10 functions by allocations
$
$(pprof) list BuildRecipe
$# Check for unnecessary allocations
$
$# Example fixes:
$# Before: strings.Split() allocates slice
$# After: strings.Index() + slicing avoids allocation
$
$# Before: fmt.Sprintf("%s:%s", name, tag)
$# After: var b strings.Builder; b.WriteString(name); b.WriteString(":");

Benchmarking

$# Benchmark snapshot performance (10 iterations)
$for i in {1..10}; do
$ time aicr snapshot --format json > /dev/null
$done 2>&1 | grep real | awk '{print $2}' | \
>sed 's/0m//' | sed 's/s//' | \
>awk '{sum+=$1; count++} END {printf "Average: %.3fs\n", sum/count}'
$
$# Compare formats
$echo "JSON:"
$time aicr snapshot --format json > /dev/null
$echo "YAML:"
$time aicr snapshot --format yaml > /dev/null
$echo "Table:"
$time aicr snapshot --format table > /dev/null
$
$# Expected results:
$# JSON: ~50ms (fastest, minimal processing)
$# YAML: ~80ms (indentation overhead)
$# Table: ~100ms (string formatting, column alignment)
$
$# Benchmark with different cluster sizes
$for pods in 10 100 1000 5000; do
$ # Scale test deployment
$ kubectl scale deployment test-app --replicas=$pods
$ kubectl wait --for=condition=ready pod -l app=test-app --timeout=5m
$
$ echo "Cluster with $pods pods:"
$ time aicr snapshot --format json > /dev/null
$done

Optimization Recommendations

  1. Reduce String Allocations
    Current: fmt.Sprintf("%s:%s", name, tag) allocates
    Optimized: Use strings.Builder for concatenation
    Savings: 20-30% fewer allocations in image collector

  2. Preallocate Slices
    Current: measurements := []Measurement{}
    Optimized: measurements := make([]Measurement, 0, expectedSize)
    Benefit: Avoids slice growth reallocations
    When: Size predictable (e.g., GPU count known)

  3. Pool Large Objects
    Use Case: Measurement structs allocated repeatedly
    Implementation:

    1var measurementPool = sync.Pool{
    2 New: func() interface{} {
    3 return &measurement.Measurement{}
    4 },
    5}
    6
    7m := measurementPool.Get().(*measurement.Measurement)
    8defer measurementPool.Put(m)

    Reference: sync.Pool

  4. Avoid Reflection
    Current: encoding/json uses reflection
    Optimized: Code-generated marshaling with easyjson
    Benefit: 2-3x faster JSON serialization
    Trade-off: Build complexity vs performance
    Reference: easyjson

  5. Batch API Operations
    Current: Multiple API calls per collector
    Optimized: Aggregate calls where possible
    Example: List all pods once, filter in memory
    Benefit: Reduces API server load, faster execution

  6. Concurrent Collectors
    Current: errgroup with limit
    Tuning: Adjust limit based on collector type

    1g.SetLimit(runtime.NumCPU()) // CPU-bound collectors
    2g.SetLimit(runtime.NumCPU() * 2) // I/O-bound collectors

    Reference: errgroup SetLimit

Security Best Practices

Running as Non-Root

CLI:

$# CLI runs as current user (no special privileges needed)
$aicr snapshot # Works as non-root
$
$# Verify no setuid/setgid
$ls -l $(which aicr)
$# Expected: -rwxr-xr-x (not -rwsr-xr-x)
$
$# Verify no capabilities
$getcap $(which aicr)
$# Expected: (no output)

Kubernetes Job:

1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: aicr
5spec:
6 template:
7 spec:
8 securityContext:
9 runAsNonRoot: true
10 runAsUser: 1000
11 runAsGroup: 1000
12 fsGroup: 1000
13 seccompProfile:
14 type: RuntimeDefault
15 containers:
16 - name: aicr
17 image: ghcr.io/nvidia/aicr:latest
18 securityContext:
19 allowPrivilegeEscalation: false
20 readOnlyRootFilesystem: true
21 capabilities:
22 drop:
23 - ALL
24 volumeMounts:
25 - name: tmp
26 mountPath: /tmp
27 volumes:
28 - name: tmp
29 emptyDir: {}

Secrets Management

$# Never log sensitive data
$# aicr already filters passwords/tokens from output
$
$# Verify no secrets in snapshot
$aicr snapshot --format json | \
> jq '.measurements[].subtypes[].data |
> keys | map(select(test("(?i)(password|token|key|secret)"))) |
> unique'
$# Expected: []
$
$# Use environment variables for API credentials (future feature)
$export AICR_API_TOKEN=$(vault kv get -field=token secret/aicr)
$aicr recipe --os ubuntu --gpu h100
$
$# Or use Kubernetes secrets
$kubectl create secret generic aicr-api-creds \
> --from-literal=token=$(vault kv get -field=token secret/aicr)
$
$# Mount in pod:
$volumeMounts:
$- name: api-creds
$ mountPath: /var/run/secrets/aicr
$ readOnly: true
$volumes:
$- name: api-creds
$ secret:
$ secretName: aicr-api-creds

Input Validation

CLI validates all inputs before processing:

$# Invalid OS type
$aicr recipe --os invalid_os
$# Error: invalid os type "invalid_os", must be one of: ubuntu, rhel, cos, amazonlinux, talos
$
$# Invalid version format
$aicr recipe --osv -1.0
$# Error: invalid version "-1.0": negative version components not allowed
$
$# Invalid GPU type
$aicr recipe --gpu h100@latest
$# Error: invalid gpu type "h100@latest": special characters not allowed
$
$# Invalid format
$aicr snapshot --format xml
$# Error: invalid format "xml", must be one of: json, yaml, table
$
$# Path traversal prevention
$aicr snapshot --output ../../etc/passwd
$# Error: output path escapes current directory
$
$# Verify validation in code:
$# pkg/cli/recipe.go:
$if !isValidOS(os) {
> return fmt.Errorf("invalid os type %q", os)
>}

Network Security

$# Verify TLS for API calls (future feature)
$aicr recipe --os ubuntu --gpu h100 --debug 2>&1 | grep -i tls
$# Expected: "Using TLS 1.3"
$
$# Certificate pinning (future enhancement)
$export AICR_API_CERT_FINGERPRINT="sha256:abc123..."
$aicr recipe --os ubuntu --gpu h100
$
$# Use corporate proxy with authentication
$export HTTPS_PROXY=https://proxy.corp.com:8080
$export AICR_PROXY_CA_CERT=/etc/ssl/certs/corp-ca.pem
$aicr recipe --os ubuntu --gpu h100

Bundler Framework: Components and Extension

The bundler framework documented under Bundle Command defines how individual components are turned into deployment artifacts. This section drills into the architecture diagrams, a worked example (GPU Operator), observability surfaces, the add-a-component workflow, and conventions for new bundlers. For command flow, flags, and usage examples, see the Bundle Command section above.

Component Diagram

The Generate README node here is the per-component bundle/<component>/README.md. The root bundle/README.md is generated by the deployer (see Deployer Framework below).

Sequence Diagram

Worked Example: GPU Operator Bundler

The GPU Operator bundler generates a complete deployment bundle for NVIDIA GPU Operator, extracting configuration from recipe measurements.

Recipe Data Extraction

K8s Measurements (measurement.TypeK8s):

  1. Image Subtype — Component versions:

    1- subtype: image
    2 data:
    3 gpu-operator: v25.3.3
    4 driver: 580.82.07
    5 container-toolkit: v1.17.8
    6 k8s-device-plugin: v0.17.4
    7 dcgm: 4.3.1-1
    8 dcgm-exporter: 4.3.1
  2. Config Subtype — Boolean flags:

    1- subtype: config
    2 data:
    3 cdi: true
    4 mig: false
    5 rdma: true
    6 useOpenKernelModule: true

GPU Measurements (measurement.TypeGPU):

1- subtype: smi
2 data:
3 driver-version: 580.82.07
4 cuda-version: "13.1"

Template Files

values.yaml.tmpl — Helm chart values:

1# Generated: {{ .Timestamp }}
2# GPU Operator Helm Values
3
4operator:
5 version: {{ .GPUOperatorVersion }}
6
7driver:
8 enabled: {{ .EnableDriver }}
9 version: {{ .DriverVersion }}
10 useOpenKernelModule: {{ .UseOpenKernelModule }}
11 repository: {{ .DriverRegistry }}
12
13toolkit:
14 version: {{ .NvidiaContainerToolkitVersion }}
15
16devicePlugin:
17 version: {{ .DevicePluginVersion }}
18
19dcgm:
20 version: {{ .DCGMVersion }}
21
22dcgmExporter:
23 version: {{ .DCGMExporterVersion }}
24
25mig:
26 strategy: {{ .MIGStrategy }}
27
28gds:
29 enabled: {{ .EnableGDS }}

install.sh.tmpl — Installation script:

$#!/bin/bash
$# Generated: {{ .Timestamp }}
$# GPU Operator Installation Script
$
$set -euo pipefail
$
$NAMESPACE="{{ .Namespace }}"
$HELM_REPO="{{ .HelmRepository }}"
$HELM_CHART="{{ .HelmChart }}"
$
$echo "Adding Helm repository..."
$helm repo add nvidia "$HELM_REPO"
$helm repo update
$
$echo "Installing GPU Operator..."
$helm install gpu-operator nvidia/gpu-operator \
> --namespace "$NAMESPACE" \
> --create-namespace \
> --values values.yaml \
> --wait
$
$echo "Applying ClusterPolicy..."
$kubectl apply -f manifests/clusterpolicy.yaml
$
$echo "Installation complete!"

Observability

Metrics

Prometheus metrics exposed by the bundler framework:

1# Duration histogram
2bundler_make_duration_seconds{bundler_type="gpu-operator"} 0.245
3
4# Total operations counter
5bundler_make_total{bundler_type="gpu-operator",result="success"} 42
6bundler_make_total{bundler_type="gpu-operator",result="error"} 3
7
8# Files generated gauge
9bundler_files_generated_total{bundler_type="gpu-operator"} 6
10
11# Bytes generated gauge
12bundler_bytes_generated_total{bundler_type="gpu-operator"} 15360
13
14# Validation failures counter
15bundler_validation_failures_total{bundler_type="gpu-operator"} 2

Structured Logging

slog integration for structured log output:

1// Bundle generation start
2slog.Debug("generating bundle",
3 "bundler_type", bundlerType,
4 "output_dir", outputDir,
5)
6
7// Bundle generation complete
8slog.Debug("bundle generated successfully",
9 "bundler_type", bundlerType,
10 "files", len(result.Files),
11 "bytes", result.TotalBytes,
12 "duration", result.Duration,
13)

Adding New Components

Adding a new component requires no Go code. Components are configured declaratively:

  1. Add to Component Registry (recipes/registry.yaml):

    1components:
    2 - name: my-operator
    3 displayName: My Operator
    4 valueOverrideKeys:
    5 - myoperator
    6 helm:
    7 defaultRepository: https://charts.example.com
    8 defaultChart: example/my-operator
    9 defaultVersion: v1.0.0
    10 nodeScheduling:
    11 system:
    12 nodeSelectorPaths:
    13 - operator.nodeSelector
    14 tolerationPaths:
    15 - operator.tolerations
  2. Create Values File (recipes/components/my-operator/values.yaml):

    1# My Operator Helm values
    2operator:
    3 replicas: 1
    4 image:
    5 repository: example/my-operator
    6 tag: v1.0.0
  3. Add to Recipe Overlay (recipes/overlays/<overlay>.yaml):

    1componentRefs:
    2 - name: my-operator
    3 type: Helm
    4 version: v1.0.0
    5 source: https://charts.example.com
    6 valuesFile: components/my-operator/values.yaml
  4. Test the Component:

    $# Generate recipe with new component
    $aicr recipe --service eks --accelerator h100 -o recipe.yaml
    $
    $# Generate bundle
    $aicr bundle -r recipe.yaml -o ./bundles
    $
    $# Verify output
    $cat ./bundles/values.yaml

See Bundler Development Guide for detailed documentation.

Best Practices

Template Design:

  • Keep templates simple and focused
  • Use descriptive variable names
  • Add comments for complex logic
  • Validate template rendering in tests
  • Don’t put business logic in templates

Error Handling:

  • Use structured errors with context (pkg/errors)
  • Wrap errors with meaningful messages
  • Validate early (before starting generation)
  • Clean up resources on error
  • Don’t swallow errors silently

Testing:

  • Test with realistic recipe data
  • Use table-driven tests for coverage
  • Test error paths explicitly
  • Verify generated file content
  • Don’t skip integration tests

Performance:

  • Use parallel generation for multiple files
  • Stream large files instead of buffering
  • Reuse template instances when possible
  • Profile bundle generation for bottlenecks
  • Don’t generate synchronously without reason

Deployer Framework: GitOps Integration

The bundle command integrates with GitOps tools through the Deployer Framework, which generates deployment-specific artifacts alongside the standard bundle files.

Overview

Purpose: Generate GitOps-ready deployment artifacts that integrate with popular continuous delivery tools.

Supported Deployers:

TypeDescriptionOutput
helm(Default) Helm per-component bundledeploy.sh, <component>/values.yaml, <component>/README.md
argocdArgo CD Application manifestsapp-of-apps.yaml, <component>/application.yaml

Key Feature: Deployment Order

All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:

1# Recipe excerpt
2deploymentOrder:
3 - gpu-operator # First
4 - network-operator # Second
5 - nvsentinel # Third

Deployer Architecture

Argo CD Deployer

Generates Argo CD Application manifests with proper sync ordering using multi-source Applications.

Ordering Mechanism: Uses argocd.argoproj.io/sync-wave annotation.

1# gpu-operator/argocd/application.yaml (sync-wave: 0 = first)
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5 name: gpu-operator
6 namespace: argocd
7 annotations:
8 argocd.argoproj.io/sync-wave: "0"
9 finalizers:
10 - resources-finalizer.argocd.argoproj.io
11spec:
12 project: default
13 sources:
14 # Helm chart from upstream
15 - repoURL: https://helm.ngc.nvidia.com/nvidia
16 chart: gpu-operator
17 targetRevision: v25.3.3
18 helm:
19 valueFiles:
20 - $values/gpu-operator/values.yaml
21 # Values from GitOps repo
22 - repoURL: <YOUR_GIT_REPO>
23 targetRevision: main
24 ref: values
25 # Additional manifests (if present)
26 - repoURL: <YOUR_GIT_REPO>
27 targetRevision: main
28 path: gpu-operator/manifests
29 destination:
30 server: https://kubernetes.default.svc
31 namespace: gpu-operator
32 syncPolicy:
33 automated:
34 prune: true
35 selfHeal: true
36 syncOptions:
37 - CreateNamespace=true
38 - ServerSideApply=true

Output Structure:

bundles/
├── app-of-apps.yaml # Parent Application (bundle root)
├── recipe.yaml # Recipe used to generate bundle
├── gpu-operator/
│ ├── values.yaml
│ ├── manifests/
│ └── argocd/
│ └── application.yaml # sync-wave: 0
├── network-operator/
│ ├── values.yaml
│ └── argocd/
│ └── application.yaml # sync-wave: 1
├── nvsentinel/
│ ├── values.yaml
│ └── argocd/
│ └── application.yaml # sync-wave: 2
└── README.md # Argo CD deployment guide

Helm Deployer (Default)

Generates a Helm per-component bundle with individual component directories.

Ordering Mechanism: Dependencies listed in Chart.yaml are deployed in order by Helm.

Output Structure:

bundles/
├── gpu-operator/
│ ├── values.yaml # Component-specific Helm values
│ ├── scripts/
│ │ └── install.sh # Installation script
│ ├── README.md # Deployment instructions
│ └── checksums.txt # SHA256 checksums
├── recipe.yaml # Input recipe reference
└── deploy.sh # Top-level deployment script

Deployer Data Flow

Usage Examples

$# Default: Helm per-component bundle
$aicr bundle -r recipe.yaml -o ./bundles
$
$# Generate bundle with Argo CD Applications
$aicr bundle -r recipe.yaml --deployer argocd -o ./bundles
$
$# Argo CD with Git repository URL (sets repoURL in app-of-apps.yaml)
$aicr bundle -r recipe.yaml --deployer argocd \
> --repo https://github.com/my-org/my-gitops-repo.git \
> -o ./bundles
$
$# Combine with deployer
$aicr bundle -r recipe.yaml \
> --deployer argocd \
> -o ./bundles

Deployment Order Implementation

The orderComponentsByDeployment function ensures components are processed in the correct sequence:

1// orderComponentsByDeployment sorts components according to deploymentOrder.
2// Components not in deploymentOrder are appended at the end in their original order.
3func orderComponentsByDeployment(components []recipe.ComponentRef,
4 order []string) []recipe.ComponentRef {
5
6 if len(order) == 0 {
7 return components
8 }
9
10 orderMap := make(map[string]int)
11 for i, name := range order {
12 orderMap[name] = i
13 }
14
15 // Separate ordered and unordered components
16 ordered := make([]recipe.ComponentRef, 0)
17 unordered := make([]recipe.ComponentRef, 0)
18
19 for _, c := range components {
20 if _, exists := orderMap[c.Name]; exists {
21 ordered = append(ordered, c)
22 } else {
23 unordered = append(unordered, c)
24 }
25 }
26
27 // Sort ordered components by their position in deploymentOrder
28 sort.SliceStable(ordered, func(i, j int) bool {
29 return orderMap[ordered[i].Name] < orderMap[ordered[j].Name]
30 })
31
32 return append(ordered, unordered...)
33}

Testing Deployers

Each deployer has tests verifying deployment order correctness:

1func TestDeployer_Generate_DeploymentOrder(t *testing.T) {
2 recipeResult := &recipe.RecipeResult{
3 DeploymentOrder: []string{"gpu-operator", "network-operator"},
4 ComponentRefs: []recipe.ComponentRef{
5 {Name: "network-operator", Version: "v25.4.0"},
6 {Name: "gpu-operator", Version: "v25.3.3"},
7 },
8 }
9
10 d := NewDeployer()
11 artifacts, err := d.Generate(ctx, recipeResult, tmpDir)
12 require.NoError(t, err)
13
14 // Verify ordering mechanism (sync-wave/dependsOn/README order)
15 // ...
16}

References

Official Documentation

Kubernetes Integration

NVIDIA Tools

Best Practices

Security