Automation and CI/CD Integration
Automation and CI/CD Integration
Integration patterns for using AICR in automated pipelines.
Overview
Typical integration workflows:
- Snapshot capture: Deploy agent Job to capture cluster configuration
- Recipe generation: Generate configuration recommendations from snapshot or query parameters
- Bundle creation: Create deployment artifacts (Helm values, manifests, scripts)
- Deployment: Apply generated configuration to cluster
- Validation: Verify deployment using test workloads
Supported CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, Argo Workflows, Tekton
Integration Patterns
Pattern 1: Configuration Snapshot + Drift Detection
Periodically capture snapshots and compare against baseline.
Use case: Detect unauthorized configuration changes
1 # GitHub Actions 2 name: Configuration Drift Detection 3 on: 4 schedule: 5 - cron: '0 */6 * * *' # Every 6 hours 6 7 jobs: 8 snapshot: 9 runs-on: ubuntu-latest 10 steps: 11 - name: Configure kubectl 12 uses: azure/k8s-set-context@v4 13 with: 14 kubeconfig: ${{ secrets.KUBECONFIG }} 15 16 - name: Deploy AICR Agent 17 run: | 18 aicr snapshot --output cm://gpu-operator/aicr-snapshot --timeout 300s 19 20 - name: Wait for completion 21 run: | 22 kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator 23 24 - name: Capture snapshot from ConfigMap 25 run: | 26 kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d-%H%M%S).yaml 27 28 - name: Compare with baseline 29 run: | 30 # Download baseline 31 curl -O https://your-artifacts/baseline.yaml 32 33 # Compare 34 if ! diff -q baseline.yaml snapshot-*.yaml; then 35 echo "::error::Configuration drift detected" 36 diff baseline.yaml snapshot-*.yaml 37 exit 1 38 fi 39 40 - name: Upload artifact 41 uses: actions/upload-artifact@v4 42 with: 43 name: cluster-snapshots 44 path: snapshot-*.yaml
Pattern 2: Recipe-Based Deployment
Generate optimized configuration and deploy operators.
Use case: Deploy GPU Operator with environment-specific settings
1 # GitLab CI 2 stages: 3 - snapshot 4 - recipe 5 - bundle 6 - deploy 7 8 capture_snapshot: 9 stage: snapshot 10 image: bitnami/kubectl:latest 11 script: 12 - aicr snapshot --output snapshot.yaml --timeout 300s 13 artifacts: 14 paths: 15 - snapshot.yaml 16 17 generate_recipe: 18 stage: recipe 19 image: ghcr.io/nvidia/aicr:latest 20 script: 21 # Option 1: Use ConfigMap directly (no artifact needed) 22 - aicr recipe -s cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow -o recipe.yaml 23 # Option 2: Use snapshot file from previous stage 24 # - aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow --output recipe.yaml 25 artifacts: 26 paths: 27 - recipe.yaml 28 dependencies: 29 - capture_snapshot 30 31 create_bundle: 32 stage: bundle 33 image: ghcr.io/nvidia/aicr:latest 34 script: 35 - aicr bundle --recipe recipe.yaml --output ./bundles 36 # Override values at bundle generation time 37 # - aicr bundle -r recipe.yaml --set gpuoperator:gds.enabled=true -o ./bundles 38 artifacts: 39 paths: 40 - bundles/ 41 dependencies: 42 - generate_recipe 43 44 deploy_operators: 45 stage: deploy 46 image: bitnami/kubectl:latest 47 script: 48 - cd bundles 49 - sha256sum -c checksums.txt 50 - chmod +x deploy.sh 51 - ./deploy.sh 52 dependencies: 53 - create_bundle 54 when: manual
Pattern 3: API-Driven Recipe Generation
Use API for recipe generation without installing CLI.
Use case: Lightweight recipe generation in containers
1 # CircleCI 2 version: 2.1 3 4 jobs: 5 generate_recipe: 6 docker: 7 - image: cimg/base:2025.01 8 steps: 9 - run: 10 name: Generate recipe via API 11 command: | 12 # Detect environment 13 OS="ubuntu" 14 ACCELERATOR="h100" 15 SERVICE="eks" 16 17 # Generate recipe 18 curl -s "http://localhost:8080/v1/recipe?os=${OS}&accelerator=${ACCELERATOR}&service=${SERVICE}&intent=training" \ 19 -o recipe.json 20 21 # Validate 22 jq -e '.measurements | length > 0' recipe.json 23 24 - persist_to_workspace: 25 root: . 26 paths: 27 - recipe.json 28 29 extract_versions: 30 docker: 31 - image: cimg/base:2025.01 32 steps: 33 - attach_workspace: 34 at: . 35 36 - run: 37 name: Extract component versions 38 command: | 39 # GPU Operator version from componentRefs 40 GPU_OP_VERSION=$(jq -r '.componentRefs[] | 41 select(.name=="gpu-operator") | .version' recipe.json) 42 43 echo "GPU Operator: $GPU_OP_VERSION" 44 45 # Save for deployment 46 echo "export GPU_OP_VERSION=$GPU_OP_VERSION" >> $BASH_ENV 47 48 workflows: 49 deploy: 50 jobs: 51 - generate_recipe 52 - extract_versions: 53 requires: 54 - generate_recipe
Pattern 4: Multi-Cluster Management
Deploy consistent configurations across multiple clusters.
Use case: Multi-region GPU clusters with unified configuration
$ #!/bin/bash $ # multi-cluster-deploy.sh $ $ # Define clusters $ CLUSTERS=( > "prod-us-east-1:eks:h100" > "prod-eu-west-1:eks:h100" > "staging-us-west-2:eks:gb200" > ) $ $ # Iterate clusters $ for cluster_config in "${CLUSTERS[@]}"; do $ IFS=":" read -r CLUSTER SERVICE GPU <<< "$cluster_config" $ $ echo "Processing cluster: $CLUSTER" $ $ # Switch context $ kubectl config use-context "$CLUSTER" $ $ # Capture snapshot $ aicr snapshot --output "snapshot-${CLUSTER}.yaml" --timeout 300s $ $ # Generate recipe (can use ConfigMap directly or file) $ # Option 1: Use ConfigMap $ aicr recipe -s "cm://gpu-operator/aicr-snapshot" --intent training --platform kubeflow -o "recipe-${CLUSTER}.yaml" $ # Option 2: Use saved file $ # aicr recipe --snapshot "snapshot-${CLUSTER}.yaml" --intent training --platform kubeflow --output "recipe-${CLUSTER}.yaml" $ $ # Create bundle $ aicr bundle \ > --recipe "recipe-${CLUSTER}.yaml" \ > --output "./bundles/${CLUSTER}" $ $ # Or with value overrides for environment-specific customization $ # aicr bundle \ > # --recipe "recipe-${CLUSTER}.yaml" \ > # --set gpuoperator:gds.enabled=true \ > # --set gpuoperator:mig.strategy=mixed \ > # --output "./bundles/${CLUSTER}" $ $ # Deploy (with approval) $ echo "Deploy to $CLUSTER? [y/N]" $ read -r response $ if [[ "$response" =~ ^[Yy]$ ]]; then $ cd "bundles/${CLUSTER}" $ chmod +x deploy.sh && ./deploy.sh $ cd - $ fi $ $ # Clean up $ kubectl delete job aicr -n gpu-operator $ done
Pattern 5: GitOps Deployment with Argo CD
Use Argo CD for declarative, GitOps-based deployments with automatic sync-wave ordering.
Use case: Automated deployment pipeline with Argo CD
1 # GitHub Actions 2 name: GitOps Deploy with Argo CD 3 on: 4 push: 5 branches: [main] 6 7 jobs: 8 generate-and-deploy: 9 runs-on: ubuntu-latest 10 steps: 11 - name: Checkout 12 uses: actions/checkout@v4 13 14 - name: Setup aicr 15 run: | 16 curl -sLO https://github.com/nvidia/aicr/releases/latest/download/aicr_linux_amd64.tar.gz 17 tar -xzf aicr_linux_amd64.tar.gz 18 sudo mv aicr /usr/local/bin/ 19 20 - name: Generate recipe 21 run: | 22 aicr recipe \ 23 --service eks \ 24 --accelerator h100 \ 25 --intent training \ 26 --os ubuntu \ 27 --output recipe.yaml 28 29 - name: Generate Argo CD bundles 30 run: | 31 aicr bundle \ 32 --recipe recipe.yaml \ 33 --deployer argocd \ 34 --repo https://github.com/${{ github.repository }}.git \ 35 --output ./bundles 36 37 - name: Commit to GitOps repo 38 run: | 39 # Copy entire bundle to GitOps repository 40 # Argo CD apps are in <component>/argocd/ directories 41 # app-of-apps.yaml is at bundle root 42 cp -r bundles/* gitops-repo/ 43 44 cd gitops-repo 45 git add . 46 git commit -m "Update GPU stack components" 47 git push
Generated Argo CD Application with multi-source:
1 # bundles/gpu-operator/argocd/application.yaml 2 apiVersion: argoproj.io/v1alpha1 3 kind: Application 4 metadata: 5 name: gpu-operator 6 namespace: argocd 7 annotations: 8 argocd.argoproj.io/sync-wave: "1" # Deployed after cert-manager (wave 0) 9 spec: 10 project: default 11 sources: 12 # Helm chart from upstream 13 - repoURL: https://helm.ngc.nvidia.com/nvidia 14 chart: gpu-operator 15 targetRevision: v25.3.3 16 helm: 17 valueFiles: 18 - $values/gpu-operator/values.yaml 19 # Values from GitOps repo 20 - repoURL: <YOUR_GIT_REPO> 21 targetRevision: main 22 ref: values 23 # Additional manifests (ClusterPolicy, etc.) 24 - repoURL: <YOUR_GIT_REPO> 25 targetRevision: main 26 path: gpu-operator/manifests 27 destination: 28 server: https://kubernetes.default.svc 29 namespace: gpu-operator 30 syncPolicy: 31 automated: 32 prune: true 33 selfHeal: true 34 syncOptions: 35 - CreateNamespace=true
Pattern 6: Multi-Environment GitOps
Deploy to multiple environments with environment-specific deployers.
$ #!/bin/bash $ # multi-env-gitops.sh $ $ ENVIRONMENTS=( > "staging:helm" # Staging uses Helm per-component bundle > "production:argocd" # Production uses Argo CD > ) $ $ for env_config in "${ENVIRONMENTS[@]}"; do $ IFS=":" read -r ENV DEPLOYER <<< "$env_config" $ $ echo "Generating bundles for $ENV with $DEPLOYER deployer..." $ $ aicr bundle \ > --recipe "recipes/${ENV}.yaml" \ > --deployer "$DEPLOYER" \ > --output "./bundles/${ENV}" $ $ echo "Generated $DEPLOYER bundles in ./bundles/${ENV}/" $ done
Terraform Integration
Module: AICR Agent Deployment
1 # modules/aicr-agent/main.tf 2 3 # Deploy agent and capture snapshot using CLI 4 resource "null_resource" "capture_snapshot" { 5 provisioner "local-exec" { 6 command = <<-EOT 7 aicr snapshot \ 8 --output ${var.snapshot_output} \ 9 --timeout 300s 10 EOT 11 } 12 } 13 14 # Generate recipe (can use ConfigMap directly) 15 resource "null_resource" "generate_recipe" { 16 provisioner "local-exec" { 17 command = <<-EOT 18 aicr recipe \ 19 -s cm://gpu-operator/aicr-snapshot \ 20 --intent ${var.workload_intent} \ 21 -o ${var.recipe_output} 22 EOT 23 } 24 25 depends_on = [null_resource.wait_for_snapshot] 26 } 27 28 # variables.tf 29 variable "node_selector" { 30 description = "Node selector for agent pod" 31 type = map(string) 32 default = { "nvidia.com/gpu.present" = "true" } 33 } 34 35 variable "tolerations" { 36 description = "Tolerations for agent pod" 37 type = list(object({ 38 key = string 39 value = string 40 effect = string 41 })) 42 default = [] 43 } 44 45 variable "image_version" { 46 description = "AICR image version" 47 type = string 48 default = "latest" 49 } 50 51 variable "snapshot_output" { 52 description = "Path to save snapshot" 53 type = string 54 default = "snapshot.yaml" 55 } 56 57 variable "recipe_output" { 58 description = "Path to save recipe" 59 type = string 60 default = "recipe.yaml" 61 } 62 63 variable "workload_intent" { 64 description = "Workload intent: training or inference" 65 type = string 66 default = "training" 67 } 68 69 # outputs.tf 70 output "snapshot_file" { 71 value = var.snapshot_output 72 } 73 74 output "recipe_file" { 75 value = var.recipe_output 76 }
Usage:
1 # main.tf 2 module "aicr_agent" { 3 source = "./modules/aicr-agent" 4 5 node_selector = { 6 "nodeGroup" = "gpu-nodes" 7 } 8 9 tolerations = [{ 10 key = "nvidia.com/gpu" 11 value = "" 12 effect = "NoSchedule" 13 }] 14 15 workload_intent = "training" 16 snapshot_output = "cluster-${var.environment}-snapshot.yaml" 17 recipe_output = "cluster-${var.environment}-recipe.yaml" 18 }
Kubernetes Operators
Custom Operator: Configuration Drift Watcher
1 // Watch for configuration changes and reconcile 2 package main 3 4 import ( 5 "context" 6 "fmt" 7 "time" 8 9 "k8s.io/client-go/kubernetes" 10 ctrl "sigs.k8s.io/controller-runtime" 11 ) 12 13 type ConfigReconciler struct { 14 Client kubernetes.Interface 15 Namespace string 16 } 17 18 func (r *ConfigReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { 19 // 1. Deploy AICR agent 20 if err := r.deployAgent(ctx); err != nil { 21 return ctrl.Result{}, err 22 } 23 24 // 2. Wait for completion 25 if err := r.waitForJob(ctx); err != nil { 26 return ctrl.Result{}, err 27 } 28 29 // 3. Retrieve snapshot 30 snapshot, err := r.getSnapshot(ctx) 31 if err != nil { 32 return ctrl.Result{}, err 33 } 34 35 // 4. Compare with baseline 36 if r.hasConfigDrift(snapshot) { 37 // Alert or auto-remediate 38 fmt.Println("Configuration drift detected!") 39 } 40 41 // 5. Clean up 42 if err := r.cleanupAgent(ctx); err != nil { 43 return ctrl.Result{}, err 44 } 45 46 // Requeue after 6 hours 47 return ctrl.Result{RequeueAfter: 6 * time.Hour}, nil 48 } 49 50 func (r *ConfigReconciler) deployAgent(ctx context.Context) error { 51 // Apply RBAC and Job manifests 52 return nil 53 } 54 55 func (r *ConfigReconciler) waitForJob(ctx context.Context) error { 56 // Wait for job completion with timeout 57 return nil 58 } 59 60 func (r *ConfigReconciler) getSnapshot(ctx context.Context) (string, error) { 61 // Retrieve snapshot from ConfigMap 62 return "", nil 63 } 64 65 func (r *ConfigReconciler) hasConfigDrift(snapshot string) bool { 66 // Compare with baseline 67 return false 68 } 69 70 func (r *ConfigReconciler) cleanupAgent(ctx context.Context) error { 71 // Delete job 72 return nil 73 }
Monitoring and Alerting
Prometheus Metrics
Scrape AICR API Server:
1 # prometheus-config.yaml 2 scrape_configs: 3 - job_name: 'aicrd' 4 static_configs: 5 - targets: ['aicrd.default.svc.cluster.local:8080'] 6 metrics_path: /metrics
Key metrics:
1 # Request rate 2 rate(aicr_http_requests_total[5m]) 3 4 # Error rate 5 rate(aicr_http_requests_total{status=~"5.."}[5m]) 6 7 # Latency (p95) 8 histogram_quantile(0.95, 9 rate(aicr_http_request_duration_seconds_bucket[5m]) 10 ) 11 12 # Rate limit rejections 13 rate(aicr_rate_limit_rejects_total[5m])
Alerting Rules
1 # prometheus-rules.yaml 2 groups: 3 - name: aicr_alerts 4 interval: 30s 5 rules: 6 - alert: AICRHighErrorRate 7 expr: | 8 rate(aicr_http_requests_total{status=~"5.."}[5m]) > 0.05 9 for: 5m 10 labels: 11 severity: warning 12 annotations: 13 summary: "AICR API high error rate" 14 description: "Error rate is {{ $value | humanizePercentage }}" 15 16 - alert: AICRHighLatency 17 expr: | 18 histogram_quantile(0.95, 19 rate(aicr_http_request_duration_seconds_bucket[5m]) 20 ) > 1 21 for: 5m 22 labels: 23 severity: warning 24 annotations: 25 summary: "AICR API high latency" 26 description: "P95 latency is {{ $value }}s" 27 28 - alert: AICRRateLimitHit 29 expr: | 30 rate(aicr_rate_limit_rejects_total[5m]) > 1 31 for: 5m 32 labels: 33 severity: info 34 annotations: 35 summary: "AICR API rate limit reached" 36 description: "Rate limit rejections: {{ $value }}/s"
Best Practices
1. Caching Recipes
API responses are cacheable (Cache-Control: max-age=300):
1 import requests 2 from cachetools import TTLCache 3 4 # Cache recipes for 5 minutes 5 recipe_cache = TTLCache(maxsize=100, ttl=300) 6 7 def get_recipe_cached(params): 8 cache_key = frozenset(params.items()) 9 10 if cache_key not in recipe_cache: 11 response = requests.get('http://localhost:8080/v1/recipe', params=params) 12 recipe_cache[cache_key] = response.json() 13 14 return recipe_cache[cache_key]
2. Error Handling and Retries
1 import requests 2 from tenacity import retry, stop_after_attempt, wait_exponential 3 4 @retry( 5 stop=stop_after_attempt(3), 6 wait=wait_exponential(multiplier=1, min=4, max=10) 7 ) 8 def get_recipe_with_retry(params): 9 response = requests.get('http://localhost:8080/v1/recipe', params=params) 10 response.raise_for_status() 11 return response.json()
3. Parallel Recipe Generation
1 from concurrent.futures import ThreadPoolExecutor 2 import requests 3 4 def get_recipe(params): 5 response = requests.get('http://localhost:8080/v1/recipe', params=params) 6 return response.json() 7 8 # Generate recipes for multiple environments in parallel 9 environments = [ 10 {'os': 'ubuntu', 'accelerator': 'h100', 'service': 'eks'}, 11 {'os': 'ubuntu', 'accelerator': 'gb200', 'service': 'gke'}, 12 {'os': 'rhel', 'accelerator': 'a100', 'service': 'aks'}, 13 ] 14 15 with ThreadPoolExecutor(max_workers=3) as executor: 16 recipes = list(executor.map(get_recipe, environments))
4. Structured Logging
1 import logging 2 import json 3 4 # Configure structured logging 5 logging.basicConfig( 6 level=logging.INFO, 7 format='%(message)s' 8 ) 9 10 def log_recipe_request(params, recipe, duration): 11 logging.info(json.dumps({ 12 'event': 'recipe_generated', 13 'params': params, 14 'component_refs': len(recipe.get('componentRefs', [])), 15 'applied_overlays': len(recipe.get('metadata', {}).get('appliedOverlays', [])), 16 'duration_ms': duration * 1000 17 }))
5. Snapshot Versioning
$ #!/bin/bash $ # Save snapshots with metadata $ $ CLUSTER="prod-us-east-1" $ TIMESTAMP=$(date +%Y%m%d-%H%M%S) $ OUTPUT="snapshot-${CLUSTER}-${TIMESTAMP}.yaml" $ $ # Capture snapshot from ConfigMap $ kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > "$OUTPUT" $ $ # Add metadata $ cat << EOF > "${OUTPUT}.meta" $ cluster: $CLUSTER $ timestamp: $TIMESTAMP $ git_commit: $(git rev-parse HEAD) $ k8s_version: $(kubectl version -o json | jq -r '.serverVersion.gitVersion') $ EOF $ $ # Upload to artifact storage $ aws s3 cp "$OUTPUT" "s3://my-bucket/snapshots/" $ aws s3 cp "${OUTPUT}.meta" "s3://my-bucket/snapshots/"
Security Considerations
API Key Management (Future)
1 import os 2 import requests 3 4 API_KEY = os.environ.get('AICR_API_KEY') 5 6 headers = { 7 'Authorization': f'Bearer {API_KEY}', 8 'X-Request-Id': generate_uuid() 9 } 10 11 response = requests.get( 12 'http://localhost:8080/v1/recipe', 13 params={'os': 'ubuntu', 'gpu': 'h100'}, 14 headers=headers 15 )
Network Policies
Restrict AICR agent network access:
1 apiVersion: networking.k8s.io/v1 2 kind: NetworkPolicy 3 metadata: 4 name: aicr-agent 5 namespace: gpu-operator 6 spec: 7 podSelector: 8 matchLabels: 9 job-name: aicr 10 policyTypes: 11 - Egress 12 egress: 13 - to: 14 - namespaceSelector: {} 15 ports: 16 - protocol: TCP 17 port: 443 # Kubernetes API
Secrets Management
1 # kubernetes-secret.yaml 2 apiVersion: v1 3 kind: Secret 4 metadata: 5 name: aicr-credentials 6 namespace: gpu-operator 7 type: Opaque 8 stringData: 9 api-key: your-api-key-here
1 # Reference in pod 2 env: 3 - name: AICR_API_KEY 4 valueFrom: 5 secretKeyRef: 6 name: aicr-credentials 7 key: api-key
Troubleshooting
Debug API Calls
$ # Verbose curl $ curl -v "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100" $ $ # With timing $ curl -w "\nTime: %{time_total}s\n" \ > "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100" $ $ # Check headers $ curl -I "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
Validate Snapshots
$ # Check YAML syntax $ yamllint snapshot.yaml $ $ # Validate structure $ yq eval '.measurements | length' snapshot.yaml $ $ # Check for required measurements $ yq eval '.measurements[] | .type' snapshot.yaml | sort -u
Test Recipe Generation
$ # Generate and validate $ aicr recipe --os ubuntu --accelerator h100 --output recipe.yaml $ yamllint recipe.yaml $ $ # Check applied overlays $ yq eval '.metadata.appliedOverlays' recipe.yaml $ $ # Extract GPU Operator version from componentRefs $ yq eval '.componentRefs[] | select(.name=="gpu-operator") | .version' recipe.yaml
See Also
- API Reference - API endpoint documentation
- Data Flow - Understanding data architecture
- Kubernetes Deployment - Self-hosted API server
- CLI Reference - CLI commands
- Agent Deployment - Kubernetes agent