CLI Architecture
The aicr CLI provides command-line access to AICR configuration management capabilities.
Overview
The CLI provides a four-step workflow for optimizing GPU infrastructure, plus a query command for inspecting hydrated recipe values:
Step 1: Snapshot Command
Captures system configuration:
- Operating system: grub, kmod, sysctl, /etc/os-release
- SystemD services: containerd, docker, kubelet (service state and configuration)
- Kubernetes: API server version, container images, ClusterPolicy custom resource
- GPU hardware: driver version, CUDA libraries, MIG configuration, device properties
- Node topology (cluster-wide taints and labels)
Output destinations:
- File:
--output system.yaml(local filesystem) - Stdout: Default (can be piped to other commands)
- ConfigMap:
--output cm://namespace/name(Kubernetes ConfigMap using Kubernetes API)
Agent deployment:
Kubernetes Job runs on GPU nodes. Writes snapshot to ConfigMap via Kubernetes API. Requires ServiceAccount with ConfigMap create/update permissions (Role in target namespace). Does not require PersistentVolume.
Step 2: Recipe Command
Generates optimized configuration recipes with two modes:
- Query Mode: Direct recipe generation from system parameters (OS, GPU, K8s, etc.)
- Snapshot Mode: Analyzes captured snapshots and generates tailored recipes based on workload intent (training/inference)
Input Options:
- Query parameters:
--os ubuntu --gpu gb200 --service eks(direct recipe generation) - Snapshot file:
--snapshot system.yaml(analyze captured snapshot) - ConfigMap:
--snapshot cm://namespace/name(read from Kubernetes)
Output Options:
- File:
--output recipe.yaml(write to file) - Stdout: Default behavior (pipe to bundle command)
- ConfigMap:
--output cm://namespace/name(store in Kubernetes)
Step 3: Validate Command
Validates recipe constraints against actual system measurements from a snapshot.
Input sources:
- Recipe file:
--recipe recipe.yaml(local filesystem) - Recipe URL:
--recipe https://example.com/recipe.yaml(HTTP/HTTPS) - Recipe ConfigMap:
--recipe cm://namespace/name(Kubernetes ConfigMap) - Snapshot file:
--snapshot snapshot.yaml(local filesystem) - Snapshot ConfigMap:
--snapshot cm://namespace/name(Kubernetes ConfigMap)
Constraint format:
Constraints use fully qualified measurement paths: \{Type\}.\{Subtype\}.\{Key\}
K8s.server.version- Kubernetes server versionOS.release.ID- Operating system identifierOS.release.VERSION_ID- OS versionOS.sysctl./proc/sys/kernel/osrelease- Kernel version
Supported operators:
>= 1.30- Greater than or equal (version comparison)<= 1.33- Less than or equal (version comparison)> 1.30,< 2.0- Strict comparison== ubuntu,!= rhel- Equality operatorsubuntu- Exact string match (no operator)
Output:
- Validation result with summary (passed/failed/skipped counts)
- Individual constraint results with expected vs actual values
- Status:
pass,fail, orpartial(some skipped)
CI/CD integration:
By default, the command exits with non-zero status when constraints fail (ideal for CI/CD). To run in informational mode without failing:
Step 4: Bundle Command
Generates deployment artifacts from recipes:
- Helm values files (values.yaml)
- Kubernetes manifests (ClusterPolicy, NICClusterPolicy, etc.)
- SHA256 checksum file
- README documentation: root
bundle/README.mdis generated by the deployer; per-componentbundle/<component>/README.mdis generated by each component bundler
Input sources:
- Recipe file:
--recipe recipe.yaml(local filesystem) - ConfigMap:
--recipe cm://namespace/name(Kubernetes ConfigMap)
Output: Local directory only. ConfigMap output is not supported for bundles.
Current bundlers:
- GPU Operator: Generates GPU Operator Helm values and ClusterPolicy manifest
- Network Operator: Generates Network Operator Helm values and NICClusterPolicy manifest
- Cert-Manager: Generates cert-manager Helm values for certificate management
- NVSentinel: Generates NVSentinel Helm values
- Nodewright: Generates Nodewright Operator Helm values and Nodewright CR manifest for node optimization
Value overrides:
The --set flag allows runtime customization of generated bundle values:
Node scheduling options:
The bundle command supports node selector and toleration flags for controlling workload placement:
Flags:
--system-node-selector key=value– Node selector for system components (repeatable)--system-node-toleration key=value:effect– Toleration for system components (repeatable)--accelerated-node-selector key=value– Node selector for GPU nodes (repeatable)--accelerated-node-toleration key=value:effect– Toleration for GPU nodes (repeatable)--nodes N– Estimated number of GPU nodes (bundle-time only; written to paths in registry undernodeScheduling.nodeCountPaths)
These flags apply selectors/tolerations to bundler-specific paths (e.g., GPU Operator uses operator.nodeSelector and daemonsets.nodeSelector). The --nodes value is applied to paths listed in the registry under nodeScheduling.nodeCountPaths.
Air-gap vendoring:
--vendor-charts pulls upstream Helm chart bytes into the bundle at bundle time, producing a self-contained artifact that eliminates Helm chart registry egress during deployment (container-image pulls and other resources may still require network access). Each vendored chart is recorded in provenance.yaml at the bundle root with name, version, source URL, and SHA256. Requires the helm binary on $PATH at bundle time; see the CLI reference for the full tradeoff (CVE-yank signal loss, bundle-size cost, auth surface).
Execution model:
- Bundlers run concurrently (parallel execution)
- All components from the recipe are bundled automatically
- Errors from any bundler cause immediate cancellation via context propagation
Testing: End-to-end workflow validated by Chainsaw tests in tests/chainsaw/cli/
Architecture Diagram
ConfigMap Integration
The CLI supports Kubernetes-native ConfigMap storage using the cm://namespace/name URI scheme:
Benefits:
- No file dependencies - Direct Kubernetes API integration
- Agent-friendly - Jobs can write snapshots without volumes
- Pipeline integration - CI/CD can read/write ConfigMaps
- Multi-cluster - Share snapshots/recipes across clusters
RBAC Requirements:
- ConfigMap read/write permissions in target namespace
- ServiceAccount with appropriate Role/RoleBinding
- See Agent Deployment for details
Component Details
Entry Point: cmd/aicr/main.go
Minimal entry point that delegates to the CLI package:
Root Command: pkg/cli/root.go
Responsibilities:
- Command registration and routing
- Version information injection (via ldflags)
- Global flag handling (debug mode, log formatting)
- Logging mode selection and initialization
Key Features:
- Version info:
version,commit,date(overridden at build time) - Three logging modes:
- CLI Mode (default): Minimal output for users (
SetDefaultCLILogger) - Text Mode (
--debug): Full metadata for debugging (SetDefaultLoggerWithLevel) - JSON Mode (
--log-json): Structured logs for automation (SetDefaultStructuredLoggerWithLevel)
- CLI Mode (default): Minimal output for users (
- Logger selection logic:
- Shell completion support
- Command listing for auto-completion
Snapshot Command: pkg/cli/snapshot.go
Captures comprehensive system configuration snapshots.
Command Flow
Detailed Data Flow
Snapshot measurement types: K8s, SystemD, OS, GPU, NodeTopology (cluster-wide node taints and labels — see pkg/measurement/types.go for the canonical constants).
Usage Examples
Agent Deployment Pattern
The snapshot command can be deployed as a Kubernetes Job for automated cluster auditing:
Deployment:
RBAC Requirements:
Key Points:
- No volumes needed - writes directly via Kubernetes API
- RBAC RoleBinding must reference correct namespace
- ConfigMap automatically created if doesn’t exist
- Supports update pattern (overwrite existing snapshots)
- RBAC and Job resources are created programmatically by
pkg/k8s/agent
Recipe Command: pkg/cli/recipe.go
Generates optimized configuration recipes based on environment parameters.
Command Flow
Detailed Data Flow
Recipe Matching Algorithm
The recipe matching uses an asymmetric rule-based query system where overlay criteria (rules) match against user queries (candidates):
Asymmetric Matching Rules:
- All non-empty fields in the overlay criteria must be satisfied by the query
- Empty overlay field → Wildcard (matches any query value)
- Query “any” field → Only matches overlay “any” (does NOT match specific overlays)
- Version fields use semantic version equality with precision awareness
This asymmetric behavior ensures generic queries (e.g., --service eks --intent training) don’t match overly specific recipes (e.g., recipes requiring accelerator: gb200).
Usage Examples
Recipe Command Modes
The recipe command supports two modes of operation:
Query Mode (Default)
Direct recipe generation from environment parameters:
Snapshot Mode
Analyze captured snapshots and generate tailored recipes:
Query Extraction from Snapshot
When using snapshot mode, the recipe builder extracts environment parameters from the snapshot:
From OS Measurements:
- release subtype → OS family (ubuntu, rhel, cos, amazonlinux, talos)
From Kubernetes Measurements:
- server subtype → K8s service provider (eks, gke, aks) inferred from images
From GPU Measurements:
- Product Name → GPU type detection (H100, GB200, B200, A100, L40, RTX PRO 6000)
- Maps product names to normalized accelerator types for recipe matching
Intent Types:
- training – Optimize for high throughput, batch processing, multi-GPU orchestration
- inference – Optimize for low latency, single-request performance, efficient batching
- any – Provides general-purpose recommendations applicable to both workloads
External Data Directory
The --data flag enables extending embedded recipe data with external files:
Requirements:
- External directory must contain
registry.yaml - No symlinks allowed (security)
- Max file size: 10MB per file
Merge Rules:
registry.yaml: Components merged by name (external overrides embedded)- All other files: External replaces embedded if path matches
Usage Examples
Recipe Output Structure
Error Handling
-
Query Mode:
- Invalid parameter values: Returns error with supported options
- Missing required parameters: Allows “any” as default fallback
- No matching overlays: Returns recipe with base configuration
-
Snapshot Mode:
- Missing snapshot file: File not found error with path
- Invalid snapshot format: Parse error with details
- Invalid intent: Returns error with supported intent types (training, inference, any)
- Extraction failures: Best-effort extraction with partial criteria
Common Errors:
- Unknown output format: Error with supported formats list (json, yaml)
Query Command: pkg/cli/query.go
Extracts specific values from the fully hydrated recipe configuration using dot-path selectors.
Command Flow
Hydration Process
The query command builds a fully hydrated map[string]any from the RecipeResult:
- Recipe-level fields (criteria, metadata, deploymentOrder, constraints) are mapped directly
- Each
ComponentRefis expanded into a component map with metadata fields (name, chart, source, version, etc.) GetValuesForComponentis called per component to merge base values, overlay values, and inline overrides- The merged values are inlined under each component’s
valueskey
Selector Resolution
The selector uses dot-delimited path walking. Leading dots are stripped (yq-style), so .components.X and components.X are equivalent. An empty selector or . returns the entire hydrated map.
Usage Examples
Implementation: pkg/recipe/query.go (HydrateResult, Select)
Bundle Command: pkg/cli/bundle.go
Generates deployment-ready bundles (Helm values, Kubernetes manifests, installation scripts) from recipes.
Command Flow
Detailed Data Flow
Bundler Data Flow
Simplified Architecture (RecipeResult-to-Template):
Key Simplification: Single RecipeResult path (no dual Recipe/RecipeResult routing)
Data Flow: RecipeResult → Values Map + ScriptData → Templates
Templates: Use index .Values "key" for config, .Script.* for metadata
Bundler Architecture
BaseBundler Helper Pattern
RecipeResult-Based Data Access
Data Flow: RecipeResult → Values/ScriptData → Template
Registry Pattern
DefaultBundler Options:
WithBundlerTypes([]BundleType)– Specify bundler types (empty = all registered)WithFailFast(bool)– Stop on first error (default: false/collect all)WithConfig(*Config)– Provide bundler configurationWithRegistry(*Registry)– Use custom bundler registry
Execution:
- Parallel execution by default: Uses
errgroup.WithContextfor concurrent execution- All bundlers run concurrently when no types specified
- Faster for multiple bundlers
- Context cancellation propagates to all bundlers
- Bundlers are stateless (thread-safe by design)
- BaseBundler provides thread-safe operations
Architecture Benefits:
- 75% less code per bundler (BaseBundler eliminates boilerplate)
- 34% less test code (TestHarness standardizes testing)
- 15+ internal helpers for recipe parsing
- Automatic registration via init() functions
- Fail-fast on duplicate bundler types
Usage Examples
Bundle Output Structure
Error Handling
Validation Errors:
- Missing recipe file: File not found error with path
- Invalid recipe format: Parse error with details
- Invalid bundler type: Error with list of supported types
- Empty measurements: Recipe validation failure
Execution Errors:
- FailFast=false (default): Collects all errors, continues execution
- Returns partial results with error list
- Exit code indicates failure count
- FailFast=true: Stops on first bundler error
- Returns immediately with error
- Subsequent bundlers not executed
Common Error Scenarios:
CLI Integration
The bundle command integrates with the CLI through:
- Shared Serializer: Uses same
serializer.FromFilefor recipe loading - Structured Logging: Consistent
slogstructured logging - Context Propagation: Respects context cancellation
- Error Patterns: Uses same error handling conventions
Log Output Example:
Shared Infrastructure
Collector Factory Pattern
The CLI uses the Factory Pattern for collector instantiation, enabling:
- Testability: Inject mock collectors for unit tests
- Flexibility: Easy to add new collector types
- Encapsulation: Hide collector creation complexity
Serializer Abstraction
Output formatting is abstracted through the serializer.Serializer interface:
Implementations:
- JSON:
encoding/jsonwith 2-space indent - YAML:
gopkg.in/yaml.v3 - Table:
text/tabwriterfor columnar display
Measurement Data Model
All collected data uses a unified measurement.Measurement structure:
Error Handling
CLI Error Strategy
- Flag Validation: User-friendly error messages for invalid flags
- Version Parsing: Specific error types (ErrNegativeComponent, etc.)
- Collector Failures: Log errors, continue with partial data where possible
- Serialization Errors: Fatal - abort and report
- Exit Codes: Non-zero exit code on any failure
Example Error Messages
Performance Characteristics
Snapshot Command
- Parallel Collection: All collectors run concurrently via
errgroup - Typical Duration: 100-500ms depending on cluster size
- Memory Usage: ~10-50MB for typical workloads
- Scalability: O(n) with number of pods/nodes for K8s collector
Recipe Command
- Store Loading: Once per process (cached via
sync.Once) - Typical Duration: <10ms after initial load
- Memory Usage: ~5-10MB (embedded YAML + parsed structure)
- Scalability: O(m) with number of overlays (typically <100)
Build Configuration
Version Injection via ldflags
Build-time version information injection:
Testing Strategy
Unit Tests
- Flag parsing and validation
- Version parsing and error handling
- Query building from command flags
- Serializer format selection
Integration Tests
- Mock collectors for deterministic output
- Full command execution with fake factory
- Output format validation
Example Test Structure
Dependencies
External Libraries
github.com/urfave/cli/v3- CLI frameworkgolang.org/x/sync/errgroup- Concurrent error handlinggopkg.in/yaml.v3- YAML parsinglog/slog- Structured logging
Internal Packages
pkg/collector- System data collectionpkg/measurement- Data modelpkg/recipe- Recipe buildingpkg/version- Semantic versioningpkg/serializer- Output formattingpkg/logging- Logging configurationpkg/snapshotter- Snapshot orchestration
Future Enhancements
Short-Term (< 3 months)
-
Caching Layer
Rationale: Reduce latency for repeatedaicr snapshotcalls in scripts
Implementation:sync.Mapwith TTL-based eviction usingtime.AfterFunc
Trade-off: Stale data risk vs 5-10x performance improvement
Reference: sync.Map -
Differential Snapshots
Use Case: CI/CD pipelines detecting configuration drift
Implementation:github.com/google/go-cmp/cmpfor deep comparison
Output: JSON Patch (RFC 6902) format for machine consumption
CLI:aicr diff baseline.yaml current.yaml --format patch -
Measurement Filtering
Use Case: Extract only GPU data without K8s overhead
CLI:aicr snapshot --filter gpu,os --exclude k8s
Implementation: Post-collection filtering before serialization
Performance: Saves 60-70% execution time when K8s excluded -
Batch Mode
Use Case: Fleet-wide configuration auditing (100s of nodes)
Implementation: Worker pool witherrgroup.SetLimit()
CLI:aicr snapshot --nodes nodes.txt --workers 10 --output results/
Reference: errgroup Limits
Mid-Term (3-6 months)
-
Plugin System
Rationale: Custom collectors without forking codebase
Interface:type Collector interface \{ Collect(context.Context) (Measurement, error) \}
Options: Go plugins (unstable across versions) or WASM (safe, portable)
Security: Sandboxed execution with restricted syscalls
Reference: WebAssembly System Interface -
Configuration Files
Use Case: Avoid repeating —os, —gpu flags
Format: YAML following XDG Base Directory spec
Location:~/.config/aicr/config.yaml(Linux/macOS),%APPDATA%\aicr\config.yaml(Windows)
Example: -
Watch Mode
Implementation: Hybrid offsnotify+ periodic polling
CLI:aicr snapshot --watch --interval 30s --on-change ./alert.sh
Output: Stream of JSON diffs to stdout
Use Case: Real-time monitoring with alerting -
Schema Validation
Use Case: Ensure snapshots conform to API version spec
Implementation: Embed JSON Schema in binary withgo:embed
Library:github.com/santhosh-tekuri/jsonschema/v5(fastest Go validator)
CLI:aicr validate --schema v1 snapshot.json
Long-Term (6-12 months)
-
gRPC Mode
Rationale: Better streaming, 3-5x smaller payloads than JSON
Implementation: Bi-directional streaming with protobuf
Trade-off: Added complexity (proto definitions) vs performance gains
Reference: gRPC Go -
Distributed Tracing
Use Case: Debug performance issues across collectors
Implementation: OpenTelemetry SDK with span per collector
Exporter: OTLP to Jaeger/Tempo
CLI:aicr snapshot --trace --trace-endpoint localhost:4317
Reference: OpenTelemetry Go -
Policy Enforcement
Use Case: Block non-compliant configs in CI/CD
Implementation: Embed OPA (github.com/open-policy-agent/opa)
CLI:aicr validate --policy policy.rego snapshot.yaml
Exit Code: 0 = pass, 1 = policy violations
Reference: OPA Go Integration -
Cloud Storage Integration
Use Case: Centralized storage for fleet management
CLI:aicr snapshot --upload s3://bucket/snapshots/$(hostname).yaml
Implementation: AWS SDK v2 with resumable uploads
Authentication: IAM roles, service accounts, credential chain
Reference: AWS SDK for Go V2
Production Deployment Patterns
Pattern 1: CI/CD Integration
Use Case: Automated configuration validation in build pipelines
GitLab CI Example:
GitHub Actions Example:
Jenkins Pipeline:
Pattern 2: Scheduled Auditing
Use Case: Nightly configuration drift detection across fleet
Kubernetes CronJob:
Systemd Timer (Bare Metal):
Enable with:
Pattern 3: Fleet Management
Use Case: Collect snapshots from 100s of GPU nodes in parallel
Ansible Playbook:
Terraform Provisioning:
Pattern 4: Real-Time Monitoring
Use Case: Continuous configuration monitoring with Prometheus alerting
Prometheus Exporter (future enhancement):
Prometheus Alerting Rules:
Advanced Usage Patterns
Snapshot Diffing with jq
Recipe Generation Pipeline
Automated Remediation
Troubleshooting Guide
Issue: “nvidia-smi not found”
Symptoms: GPU measurements empty, error in logs
Root Cause: NVIDIA driver not installed or not in PATH
Diagnosis:
Resolution:
Issue: “Kubernetes API server unreachable”
Symptoms: K8s measurements empty, “connection refused” error
Root Cause: Not running in cluster, or kubeconfig missing/invalid
Diagnosis:
Resolution:
Issue: “Snapshot too slow (> 5s)”
Symptoms: Long execution time, timeouts in CI/CD
Root Cause: Large cluster (1000s of pods), slow API server, many GPUs
Diagnosis:
Resolution:
Issue: “Out of memory during snapshot”
Symptoms: Process killed, OOMKilled in K8s, segfault
Root Cause: Large measurement data (10k+ pods, many images)
Diagnosis:
Resolution:
Performance Tuning
CPU Profiling
Memory Profiling
Benchmarking
Optimization Recommendations
-
Reduce String Allocations
Current:fmt.Sprintf("%s:%s", name, tag)allocates
Optimized: Usestrings.Builderfor concatenation
Savings: 20-30% fewer allocations in image collector -
Preallocate Slices
Current:measurements := []Measurement\{\}
Optimized:measurements := make([]Measurement, 0, expectedSize)
Benefit: Avoids slice growth reallocations
When: Size predictable (e.g., GPU count known) -
Pool Large Objects
Use Case: Measurement structs allocated repeatedly
Implementation:Reference: sync.Pool
-
Avoid Reflection
Current:encoding/jsonuses reflection
Optimized: Code-generated marshaling witheasyjson
Benefit: 2-3x faster JSON serialization
Trade-off: Build complexity vs performance
Reference: easyjson -
Batch API Operations
Current: Multiple API calls per collector
Optimized: Aggregate calls where possible
Example: List all pods once, filter in memory
Benefit: Reduces API server load, faster execution -
Concurrent Collectors
Current:errgroupwith limit
Tuning: Adjust limit based on collector typeReference: errgroup SetLimit
Security Best Practices
Running as Non-Root
CLI:
Kubernetes Job:
Secrets Management
Input Validation
CLI validates all inputs before processing:
Network Security
Bundler Framework: Components and Extension
The bundler framework documented under Bundle Command defines how individual components are turned into deployment artifacts. This section drills into the architecture diagrams, a worked example (GPU Operator), observability surfaces, the add-a-component workflow, and conventions for new bundlers. For command flow, flags, and usage examples, see the Bundle Command section above.
Component Diagram
The Generate README node here is the per-component bundle/<component>/README.md. The root bundle/README.md is generated by the deployer (see Deployer Framework below).
Sequence Diagram
Worked Example: GPU Operator Bundler
The GPU Operator bundler generates a complete deployment bundle for NVIDIA GPU Operator, extracting configuration from recipe measurements.
Recipe Data Extraction
K8s Measurements (measurement.TypeK8s):
-
Image Subtype — Component versions:
-
Config Subtype — Boolean flags:
GPU Measurements (measurement.TypeGPU):
Template Files
values.yaml.tmpl — Helm chart values:
install.sh.tmpl — Installation script:
Observability
Metrics
Prometheus metrics exposed by the bundler framework:
Structured Logging
slog integration for structured log output:
Adding New Components
Adding a new component requires no Go code. Components are configured declaratively:
-
Add to Component Registry (
recipes/registry.yaml): -
Create Values File (
recipes/components/my-operator/values.yaml): -
Add to Recipe Overlay (
recipes/overlays/<overlay>.yaml): -
Test the Component:
See Bundler Development Guide for detailed documentation.
Best Practices
Template Design:
- Keep templates simple and focused
- Use descriptive variable names
- Add comments for complex logic
- Validate template rendering in tests
- Don’t put business logic in templates
Error Handling:
- Use structured errors with context (
pkg/errors) - Wrap errors with meaningful messages
- Validate early (before starting generation)
- Clean up resources on error
- Don’t swallow errors silently
Testing:
- Test with realistic recipe data
- Use table-driven tests for coverage
- Test error paths explicitly
- Verify generated file content
- Don’t skip integration tests
Performance:
- Use parallel generation for multiple files
- Stream large files instead of buffering
- Reuse template instances when possible
- Profile bundle generation for bottlenecks
- Don’t generate synchronously without reason
Deployer Framework: GitOps Integration
The bundle command integrates with GitOps tools through the Deployer Framework, which generates deployment-specific artifacts alongside the standard bundle files.
Overview
Purpose: Generate GitOps-ready deployment artifacts that integrate with popular continuous delivery tools.
Supported Deployers:
Key Feature: Deployment Order
All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:
Deployer Architecture
Argo CD Deployer
Generates Argo CD Application manifests with proper sync ordering using multi-source Applications.
Ordering Mechanism: Uses argocd.argoproj.io/sync-wave annotation.
Output Structure:
Helm Deployer (Default)
Generates a Helm per-component bundle with individual component directories.
Ordering Mechanism: Dependencies listed in Chart.yaml are deployed in order by Helm.
Output Structure:
Deployer Data Flow
Usage Examples
Deployment Order Implementation
The orderComponentsByDeployment function ensures components are processed in the correct sequence:
Testing Deployers
Each deployer has tests verifying deployment order correctness:
References
Official Documentation
- urfave/cli Framework - CLI framework used by aicr
- errgroup Patterns - Concurrent error handling
- YAML v3 Library - YAML parsing and serialization
- Structured Logging (slog) - Standard library logging
- Context Package - Cancellation and deadlines
Kubernetes Integration
- client-go Documentation - Official K8s client
- Dynamic Client - Unstructured resource access
- CronJob Best Practices - Scheduled job patterns
- RBAC Authorization - Permission model
NVIDIA Tools
- NVIDIA SMI - GPU management
- NVML Library - Programmatic GPU access
- CUDA Toolkit - GPU computing platform
- GPU Operator - K8s GPU automation
Best Practices
- Semantic Versioning - Version comparison algorithm
- The Twelve-Factor App - Cloud-native application patterns
- Release Engineering Best Practices - Google SRE
- Go Code Review Comments - Idiomatic Go
Security
- OWASP Secure Coding Practices
- Kubernetes Pod Security Standards
- NIST 800-190: Container Security
- CIS Benchmarks - Security configuration baselines