Automation and CI/CD Integration

View as Markdown

Integration patterns for using AICR in automated pipelines.

Overview

Typical integration workflows:

  1. Snapshot capture: Deploy agent Job to capture cluster configuration
  2. Recipe generation: Generate configuration recommendations from snapshot or query parameters
  3. Bundle creation: Create deployment artifacts (Helm values, manifests, scripts)
  4. Deployment: Apply generated configuration to cluster
  5. Validation: Verify deployment using test workloads

Supported CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, Argo Workflows, Tekton

Integration Patterns

Pattern 1: Configuration Snapshot + Drift Detection

Periodically capture snapshots and compare against baseline.

Use case: Detect unauthorized configuration changes

1# GitHub Actions
2name: Configuration Drift Detection
3on:
4 schedule:
5 - cron: '0 */6 * * *' # Every 6 hours
6
7jobs:
8 snapshot:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Configure kubectl
12 uses: azure/k8s-set-context@v4
13 with:
14 kubeconfig: ${{ secrets.KUBECONFIG }}
15
16 - name: Deploy AICR Agent
17 run: |
18 aicr snapshot --output cm://gpu-operator/aicr-snapshot --timeout 300s
19
20 - name: Wait for completion
21 run: |
22 kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator
23
24 - name: Capture snapshot from ConfigMap
25 run: |
26 kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d-%H%M%S).yaml
27
28 - name: Compare with baseline
29 run: |
30 # Download baseline
31 curl -O https://your-artifacts/baseline.yaml
32
33 # Compare
34 if ! diff -q baseline.yaml snapshot-*.yaml; then
35 echo "::error::Configuration drift detected"
36 diff baseline.yaml snapshot-*.yaml
37 exit 1
38 fi
39
40 - name: Upload artifact
41 uses: actions/upload-artifact@v4
42 with:
43 name: cluster-snapshots
44 path: snapshot-*.yaml

Pattern 2: Recipe-Based Deployment

Generate optimized configuration and deploy operators.

Use case: Deploy GPU Operator with environment-specific settings

1# GitLab CI
2stages:
3 - snapshot
4 - recipe
5 - bundle
6 - deploy
7
8capture_snapshot:
9 stage: snapshot
10 image: bitnami/kubectl:latest
11 script:
12 - aicr snapshot --output snapshot.yaml --timeout 300s
13 artifacts:
14 paths:
15 - snapshot.yaml
16
17generate_recipe:
18 stage: recipe
19 image: ghcr.io/nvidia/aicr:latest
20 script:
21 # Option 1: Use ConfigMap directly (no artifact needed)
22 - aicr recipe -s cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow -o recipe.yaml
23 # Option 2: Use snapshot file from previous stage
24 # - aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow --output recipe.yaml
25 artifacts:
26 paths:
27 - recipe.yaml
28 dependencies:
29 - capture_snapshot
30
31create_bundle:
32 stage: bundle
33 image: ghcr.io/nvidia/aicr:latest
34 script:
35 - aicr bundle --recipe recipe.yaml --output ./bundles
36 # Override values at bundle generation time
37 # - aicr bundle -r recipe.yaml --set gpuoperator:gds.enabled=true -o ./bundles
38 artifacts:
39 paths:
40 - bundles/
41 dependencies:
42 - generate_recipe
43
44deploy_operators:
45 stage: deploy
46 image: bitnami/kubectl:latest
47 script:
48 - cd bundles
49 - sha256sum -c checksums.txt
50 - chmod +x deploy.sh
51 - ./deploy.sh
52 dependencies:
53 - create_bundle
54 when: manual

Pattern 3: API-Driven Recipe Generation

Use API for recipe generation without installing CLI.

Use case: Lightweight recipe generation in containers

1# CircleCI
2version: 2.1
3
4jobs:
5 generate_recipe:
6 docker:
7 - image: cimg/base:2025.01
8 steps:
9 - run:
10 name: Generate recipe via API
11 command: |
12 # Detect environment
13 OS="ubuntu"
14 ACCELERATOR="h100"
15 SERVICE="eks"
16
17 # Generate recipe
18 curl -s "http://localhost:8080/v1/recipe?os=${OS}&accelerator=${ACCELERATOR}&service=${SERVICE}&intent=training" \
19 -o recipe.json
20
21 # Validate
22 jq -e '.measurements | length > 0' recipe.json
23
24 - persist_to_workspace:
25 root: .
26 paths:
27 - recipe.json
28
29 extract_versions:
30 docker:
31 - image: cimg/base:2025.01
32 steps:
33 - attach_workspace:
34 at: .
35
36 - run:
37 name: Extract component versions
38 command: |
39 # GPU Operator version from componentRefs
40 GPU_OP_VERSION=$(jq -r '.componentRefs[] |
41 select(.name=="gpu-operator") | .version' recipe.json)
42
43 echo "GPU Operator: $GPU_OP_VERSION"
44
45 # Save for deployment
46 echo "export GPU_OP_VERSION=$GPU_OP_VERSION" >> $BASH_ENV
47
48workflows:
49 deploy:
50 jobs:
51 - generate_recipe
52 - extract_versions:
53 requires:
54 - generate_recipe

Pattern 4: Multi-Cluster Management

Deploy consistent configurations across multiple clusters.

Use case: Multi-region GPU clusters with unified configuration

$#!/bin/bash
$# multi-cluster-deploy.sh
$
$# Define clusters
$CLUSTERS=(
> "prod-us-east-1:eks:h100"
> "prod-eu-west-1:eks:h100"
> "staging-us-west-2:eks:gb200"
>)
$
$# Iterate clusters
$for cluster_config in "${CLUSTERS[@]}"; do
$ IFS=":" read -r CLUSTER SERVICE GPU <<< "$cluster_config"
$
$ echo "Processing cluster: $CLUSTER"
$
$ # Switch context
$ kubectl config use-context "$CLUSTER"
$
$ # Capture snapshot
$ aicr snapshot --output "snapshot-${CLUSTER}.yaml" --timeout 300s
$
$ # Generate recipe (can use ConfigMap directly or file)
$ # Option 1: Use ConfigMap
$ aicr recipe -s "cm://gpu-operator/aicr-snapshot" --intent training --platform kubeflow -o "recipe-${CLUSTER}.yaml"
$ # Option 2: Use saved file
$ # aicr recipe --snapshot "snapshot-${CLUSTER}.yaml" --intent training --platform kubeflow --output "recipe-${CLUSTER}.yaml"
$
$ # Create bundle
$ aicr bundle \
> --recipe "recipe-${CLUSTER}.yaml" \
> --output "./bundles/${CLUSTER}"
$
$ # Or with value overrides for environment-specific customization
$ # aicr bundle \
> # --recipe "recipe-${CLUSTER}.yaml" \
> # --set gpuoperator:gds.enabled=true \
> # --set gpuoperator:mig.strategy=mixed \
> # --output "./bundles/${CLUSTER}"
$
$ # Deploy (with approval)
$ echo "Deploy to $CLUSTER? [y/N]"
$ read -r response
$ if [[ "$response" =~ ^[Yy]$ ]]; then
$ cd "bundles/${CLUSTER}"
$ chmod +x deploy.sh && ./deploy.sh
$ cd -
$ fi
$
$ # Clean up
$ kubectl delete job aicr -n gpu-operator
$done

Pattern 5: GitOps Deployment with Argo CD

Use Argo CD for declarative, GitOps-based deployments with automatic sync-wave ordering.

Use case: Automated deployment pipeline with Argo CD

1# GitHub Actions
2name: GitOps Deploy with Argo CD
3on:
4 push:
5 branches: [main]
6
7jobs:
8 generate-and-deploy:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Checkout
12 uses: actions/checkout@v4
13
14 - name: Setup aicr
15 run: |
16 curl -sLO https://github.com/nvidia/aicr/releases/latest/download/aicr_linux_amd64.tar.gz
17 tar -xzf aicr_linux_amd64.tar.gz
18 sudo mv aicr /usr/local/bin/
19
20 - name: Generate recipe
21 run: |
22 aicr recipe \
23 --service eks \
24 --accelerator h100 \
25 --intent training \
26 --os ubuntu \
27 --output recipe.yaml
28
29 - name: Generate Argo CD bundles
30 run: |
31 aicr bundle \
32 --recipe recipe.yaml \
33 --deployer argocd \
34 --repo https://github.com/${{ github.repository }}.git \
35 --output ./bundles
36
37 - name: Commit to GitOps repo
38 run: |
39 # Copy entire bundle to GitOps repository
40 # Argo CD apps are in <component>/argocd/ directories
41 # app-of-apps.yaml is at bundle root
42 cp -r bundles/* gitops-repo/
43
44 cd gitops-repo
45 git add .
46 git commit -m "Update GPU stack components"
47 git push

Generated Argo CD Application with multi-source:

1# bundles/gpu-operator/argocd/application.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5 name: gpu-operator
6 namespace: argocd
7 annotations:
8 argocd.argoproj.io/sync-wave: "1" # Deployed after cert-manager (wave 0)
9spec:
10 project: default
11 sources:
12 # Helm chart from upstream
13 - repoURL: https://helm.ngc.nvidia.com/nvidia
14 chart: gpu-operator
15 targetRevision: v25.3.3
16 helm:
17 valueFiles:
18 - $values/gpu-operator/values.yaml
19 # Values from GitOps repo
20 - repoURL: <YOUR_GIT_REPO>
21 targetRevision: main
22 ref: values
23 # Additional manifests (ClusterPolicy, etc.)
24 - repoURL: <YOUR_GIT_REPO>
25 targetRevision: main
26 path: gpu-operator/manifests
27 destination:
28 server: https://kubernetes.default.svc
29 namespace: gpu-operator
30 syncPolicy:
31 automated:
32 prune: true
33 selfHeal: true
34 syncOptions:
35 - CreateNamespace=true

Pattern 6: Multi-Environment GitOps

Deploy to multiple environments with environment-specific deployers.

$#!/bin/bash
$# multi-env-gitops.sh
$
$ENVIRONMENTS=(
> "staging:helm" # Staging uses Helm per-component bundle
> "production:argocd" # Production uses Argo CD
>)
$
$for env_config in "${ENVIRONMENTS[@]}"; do
$ IFS=":" read -r ENV DEPLOYER <<< "$env_config"
$
$ echo "Generating bundles for $ENV with $DEPLOYER deployer..."
$
$ aicr bundle \
> --recipe "recipes/${ENV}.yaml" \
> --deployer "$DEPLOYER" \
> --output "./bundles/${ENV}"
$
$ echo "Generated $DEPLOYER bundles in ./bundles/${ENV}/"
$done

Terraform Integration

Module: AICR Agent Deployment

1# modules/aicr-agent/main.tf
2
3# Deploy agent and capture snapshot using CLI
4resource "null_resource" "capture_snapshot" {
5 provisioner "local-exec" {
6 command = <<-EOT
7 aicr snapshot \
8 --output ${var.snapshot_output} \
9 --timeout 300s
10 EOT
11 }
12}
13
14# Generate recipe (can use ConfigMap directly)
15resource "null_resource" "generate_recipe" {
16 provisioner "local-exec" {
17 command = <<-EOT
18 aicr recipe \
19 -s cm://gpu-operator/aicr-snapshot \
20 --intent ${var.workload_intent} \
21 -o ${var.recipe_output}
22 EOT
23 }
24
25 depends_on = [null_resource.wait_for_snapshot]
26}
27
28# variables.tf
29variable "node_selector" {
30 description = "Node selector for agent pod"
31 type = map(string)
32 default = { "nvidia.com/gpu.present" = "true" }
33}
34
35variable "tolerations" {
36 description = "Tolerations for agent pod"
37 type = list(object({
38 key = string
39 value = string
40 effect = string
41 }))
42 default = []
43}
44
45variable "image_version" {
46 description = "AICR image version"
47 type = string
48 default = "latest"
49}
50
51variable "snapshot_output" {
52 description = "Path to save snapshot"
53 type = string
54 default = "snapshot.yaml"
55}
56
57variable "recipe_output" {
58 description = "Path to save recipe"
59 type = string
60 default = "recipe.yaml"
61}
62
63variable "workload_intent" {
64 description = "Workload intent: training or inference"
65 type = string
66 default = "training"
67}
68
69# outputs.tf
70output "snapshot_file" {
71 value = var.snapshot_output
72}
73
74output "recipe_file" {
75 value = var.recipe_output
76}

Usage:

1# main.tf
2module "aicr_agent" {
3 source = "./modules/aicr-agent"
4
5 node_selector = {
6 "nodeGroup" = "gpu-nodes"
7 }
8
9 tolerations = [{
10 key = "nvidia.com/gpu"
11 value = ""
12 effect = "NoSchedule"
13 }]
14
15 workload_intent = "training"
16 snapshot_output = "cluster-${var.environment}-snapshot.yaml"
17 recipe_output = "cluster-${var.environment}-recipe.yaml"
18}

Kubernetes Operators

Custom Operator: Configuration Drift Watcher

1// Watch for configuration changes and reconcile
2package main
3
4import (
5 "context"
6 "fmt"
7 "time"
8
9 "k8s.io/client-go/kubernetes"
10 ctrl "sigs.k8s.io/controller-runtime"
11)
12
13type ConfigReconciler struct {
14 Client kubernetes.Interface
15 Namespace string
16}
17
18func (r *ConfigReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
19 // 1. Deploy AICR agent
20 if err := r.deployAgent(ctx); err != nil {
21 return ctrl.Result{}, err
22 }
23
24 // 2. Wait for completion
25 if err := r.waitForJob(ctx); err != nil {
26 return ctrl.Result{}, err
27 }
28
29 // 3. Retrieve snapshot
30 snapshot, err := r.getSnapshot(ctx)
31 if err != nil {
32 return ctrl.Result{}, err
33 }
34
35 // 4. Compare with baseline
36 if r.hasConfigDrift(snapshot) {
37 // Alert or auto-remediate
38 fmt.Println("Configuration drift detected!")
39 }
40
41 // 5. Clean up
42 if err := r.cleanupAgent(ctx); err != nil {
43 return ctrl.Result{}, err
44 }
45
46 // Requeue after 6 hours
47 return ctrl.Result{RequeueAfter: 6 * time.Hour}, nil
48}
49
50func (r *ConfigReconciler) deployAgent(ctx context.Context) error {
51 // Apply RBAC and Job manifests
52 return nil
53}
54
55func (r *ConfigReconciler) waitForJob(ctx context.Context) error {
56 // Wait for job completion with timeout
57 return nil
58}
59
60func (r *ConfigReconciler) getSnapshot(ctx context.Context) (string, error) {
61 // Retrieve snapshot from ConfigMap
62 return "", nil
63}
64
65func (r *ConfigReconciler) hasConfigDrift(snapshot string) bool {
66 // Compare with baseline
67 return false
68}
69
70func (r *ConfigReconciler) cleanupAgent(ctx context.Context) error {
71 // Delete job
72 return nil
73}

Monitoring and Alerting

Prometheus Metrics

Scrape AICR API Server:

1# prometheus-config.yaml
2scrape_configs:
3 - job_name: 'aicrd'
4 static_configs:
5 - targets: ['aicrd.default.svc.cluster.local:8080']
6 metrics_path: /metrics

Key metrics:

1# Request rate
2rate(aicr_http_requests_total[5m])
3
4# Error rate
5rate(aicr_http_requests_total{status=~"5.."}[5m])
6
7# Latency (p95)
8histogram_quantile(0.95,
9 rate(aicr_http_request_duration_seconds_bucket[5m])
10)
11
12# Rate limit rejections
13rate(aicr_rate_limit_rejects_total[5m])

Alerting Rules

1# prometheus-rules.yaml
2groups:
3 - name: aicr_alerts
4 interval: 30s
5 rules:
6 - alert: AICRHighErrorRate
7 expr: |
8 rate(aicr_http_requests_total{status=~"5.."}[5m]) > 0.05
9 for: 5m
10 labels:
11 severity: warning
12 annotations:
13 summary: "AICR API high error rate"
14 description: "Error rate is {{ $value | humanizePercentage }}"
15
16 - alert: AICRHighLatency
17 expr: |
18 histogram_quantile(0.95,
19 rate(aicr_http_request_duration_seconds_bucket[5m])
20 ) > 1
21 for: 5m
22 labels:
23 severity: warning
24 annotations:
25 summary: "AICR API high latency"
26 description: "P95 latency is {{ $value }}s"
27
28 - alert: AICRRateLimitHit
29 expr: |
30 rate(aicr_rate_limit_rejects_total[5m]) > 1
31 for: 5m
32 labels:
33 severity: info
34 annotations:
35 summary: "AICR API rate limit reached"
36 description: "Rate limit rejections: {{ $value }}/s"

Best Practices

1. Caching Recipes

API responses are cacheable (Cache-Control: max-age=300):

1import requests
2from cachetools import TTLCache
3
4# Cache recipes for 5 minutes
5recipe_cache = TTLCache(maxsize=100, ttl=300)
6
7def get_recipe_cached(params):
8 cache_key = frozenset(params.items())
9
10 if cache_key not in recipe_cache:
11 response = requests.get('http://localhost:8080/v1/recipe', params=params)
12 recipe_cache[cache_key] = response.json()
13
14 return recipe_cache[cache_key]

2. Error Handling and Retries

1import requests
2from tenacity import retry, stop_after_attempt, wait_exponential
3
4@retry(
5 stop=stop_after_attempt(3),
6 wait=wait_exponential(multiplier=1, min=4, max=10)
7)
8def get_recipe_with_retry(params):
9 response = requests.get('http://localhost:8080/v1/recipe', params=params)
10 response.raise_for_status()
11 return response.json()

3. Parallel Recipe Generation

1from concurrent.futures import ThreadPoolExecutor
2import requests
3
4def get_recipe(params):
5 response = requests.get('http://localhost:8080/v1/recipe', params=params)
6 return response.json()
7
8# Generate recipes for multiple environments in parallel
9environments = [
10 {'os': 'ubuntu', 'accelerator': 'h100', 'service': 'eks'},
11 {'os': 'ubuntu', 'accelerator': 'gb200', 'service': 'gke'},
12 {'os': 'rhel', 'accelerator': 'a100', 'service': 'aks'},
13]
14
15with ThreadPoolExecutor(max_workers=3) as executor:
16 recipes = list(executor.map(get_recipe, environments))

4. Structured Logging

1import logging
2import json
3
4# Configure structured logging
5logging.basicConfig(
6 level=logging.INFO,
7 format='%(message)s'
8)
9
10def log_recipe_request(params, recipe, duration):
11 logging.info(json.dumps({
12 'event': 'recipe_generated',
13 'params': params,
14 'component_refs': len(recipe.get('componentRefs', [])),
15 'applied_overlays': len(recipe.get('metadata', {}).get('appliedOverlays', [])),
16 'duration_ms': duration * 1000
17 }))

5. Snapshot Versioning

$#!/bin/bash
$# Save snapshots with metadata
$
$CLUSTER="prod-us-east-1"
$TIMESTAMP=$(date +%Y%m%d-%H%M%S)
$OUTPUT="snapshot-${CLUSTER}-${TIMESTAMP}.yaml"
$
$# Capture snapshot from ConfigMap
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > "$OUTPUT"
$
$# Add metadata
$cat << EOF > "${OUTPUT}.meta"
$cluster: $CLUSTER
$timestamp: $TIMESTAMP
$git_commit: $(git rev-parse HEAD)
$k8s_version: $(kubectl version -o json | jq -r '.serverVersion.gitVersion')
$EOF
$
$# Upload to artifact storage
$aws s3 cp "$OUTPUT" "s3://my-bucket/snapshots/"
$aws s3 cp "${OUTPUT}.meta" "s3://my-bucket/snapshots/"

Security Considerations

API Key Management (Future)

1import os
2import requests
3
4API_KEY = os.environ.get('AICR_API_KEY')
5
6headers = {
7 'Authorization': f'Bearer {API_KEY}',
8 'X-Request-Id': generate_uuid()
9}
10
11response = requests.get(
12 'http://localhost:8080/v1/recipe',
13 params={'os': 'ubuntu', 'gpu': 'h100'},
14 headers=headers
15)

Network Policies

Restrict AICR agent network access:

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: aicr-agent
5 namespace: gpu-operator
6spec:
7 podSelector:
8 matchLabels:
9 job-name: aicr
10 policyTypes:
11 - Egress
12 egress:
13 - to:
14 - namespaceSelector: {}
15 ports:
16 - protocol: TCP
17 port: 443 # Kubernetes API

Secrets Management

1# kubernetes-secret.yaml
2apiVersion: v1
3kind: Secret
4metadata:
5 name: aicr-credentials
6 namespace: gpu-operator
7type: Opaque
8stringData:
9 api-key: your-api-key-here
1# Reference in pod
2env:
3 - name: AICR_API_KEY
4 valueFrom:
5 secretKeyRef:
6 name: aicr-credentials
7 key: api-key

Troubleshooting

Debug API Calls

$# Verbose curl
$curl -v "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$
$# With timing
$curl -w "\nTime: %{time_total}s\n" \
> "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$
$# Check headers
$curl -I "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"

Validate Snapshots

$# Check YAML syntax
$yamllint snapshot.yaml
$
$# Validate structure
$yq eval '.measurements | length' snapshot.yaml
$
$# Check for required measurements
$yq eval '.measurements[] | .type' snapshot.yaml | sort -u

Test Recipe Generation

$# Generate and validate
$aicr recipe --os ubuntu --accelerator h100 --output recipe.yaml
$yamllint recipe.yaml
$
$# Check applied overlays
$yq eval '.metadata.appliedOverlays' recipe.yaml
$
$# Extract GPU Operator version from componentRefs
$yq eval '.componentRefs[] | select(.name=="gpu-operator") | .version' recipe.yaml

See Also