Automation and CI/CD Integration

View as Markdown

Integration patterns for using AICR in automated pipelines.

Overview

Typical integration workflows:

  1. Snapshot capture: Deploy agent Job to capture cluster configuration
  2. Recipe generation: Generate configuration recommendations from snapshot or query parameters
  3. Bundle creation: Create deployment artifacts (Helm values, manifests, scripts)
  4. Deployment: Apply generated configuration to cluster
  5. Validation: Verify deployment using test workloads

Supported CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, Argo Workflows, Tekton

Integration Patterns

Pattern 1: Configuration Snapshot + Drift Detection

Periodically capture snapshots and compare against baseline.

Use case: Detect unauthorized configuration changes

1# GitHub Actions
2name: Configuration Drift Detection
3on:
4 schedule:
5 - cron: '0 */6 * * *' # Every 6 hours
6
7jobs:
8 snapshot:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Configure kubectl
12 uses: azure/k8s-set-context@v4
13 with:
14 kubeconfig: ${{ secrets.KUBECONFIG }}
15
16 - name: Deploy AICR Agent
17 run: |
18 aicr snapshot --output cm://gpu-operator/aicr-snapshot --timeout 300s
19
20 - name: Wait for completion
21 run: |
22 kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator
23
24 - name: Capture snapshot from ConfigMap
25 run: |
26 kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d-%H%M%S).yaml
27
28 - name: Compare with baseline
29 run: |
30 # Download baseline
31 curl -O https://your-artifacts/baseline.yaml
32
33 # Compare
34 if ! diff -q baseline.yaml snapshot-*.yaml; then
35 echo "::error::Configuration drift detected"
36 diff baseline.yaml snapshot-*.yaml
37 exit 1
38 fi
39
40 - name: Upload artifact
41 uses: actions/upload-artifact@v4
42 with:
43 name: cluster-snapshots
44 path: snapshot-*.yaml

Pattern 2: Canonical Snapshot to Bundle Pipeline

Generate optimized configuration and deploy operators. The pipeline below is the canonical reference: every stage uses the same aicr CLI invocations, so it translates directly to any CI system (see Translating to other CI systems below).

Use case: Deploy GPU Operator with environment-specific settings

1# GitHub Actions
2name: GPU Stack Deploy
3on:
4 workflow_dispatch:
5
6jobs:
7 deploy:
8 runs-on: ubuntu-latest
9 steps:
10 - name: Configure kubectl
11 uses: azure/k8s-set-context@v4
12 with:
13 kubeconfig: ${{ secrets.KUBECONFIG }}
14
15 # 1. Snapshot: agent Job writes cluster state to a ConfigMap.
16 - name: Capture snapshot
17 run: |
18 aicr snapshot --output cm://gpu-operator/aicr-snapshot --timeout 300s
19 kubectl wait --for=condition=complete --timeout=300s \
20 job/aicr -n gpu-operator
21
22 # 2. Recipe: read the snapshot ConfigMap, emit an optimized recipe.
23 # Use --service/--accelerator/--intent flags for query mode instead.
24 - name: Generate recipe
25 run: |
26 aicr recipe \
27 --snapshot cm://gpu-operator/aicr-snapshot \
28 --intent training \
29 --platform kubeflow \
30 --output recipe.yaml
31
32 # 3. Bundle: render deployment artifacts. Add --set to override values,
33 # or --deployer argocd for GitOps output (see Pattern 3).
34 - name: Create bundle
35 run: aicr bundle --recipe recipe.yaml --output ./bundles
36
37 # 4. Deploy: verify checksums, then run the generated script.
38 - name: Deploy
39 run: |
40 cd bundles
41 sha256sum -c checksums.txt
42 chmod +x deploy.sh
43 ./deploy.sh

Translating to other CI systems

The four stages above map one-to-one onto other platforms. Only the job/stage syntax and artifact passing differ — the aicr commands are identical.

StageGitLab CICircleCITerraform
Snapshotscript: step running aicr snapshot, declare artifacts: paths: [snapshot.yaml] (or write to a ConfigMap to skip artifacts)run: step; persist_to_workspace to pass output downstreamnull_resource + local-exec provisioner running aicr snapshot
Recipescript: step running aicr recipe, dependencies: on the snapshot jobrun: step after attach_workspacenull_resource + local-exec, depends_on the snapshot resource
Bundlescript: step running aicr bundle, publish bundles/ as artifactsrun: step, persist_to_workspacenull_resource + local-exec running aicr bundle
Deployscript: step with when: manual for approval gatingrun: step inside a held workflowlocal-exec running deploy.sh, gated by terraform apply approval

Use a container image with the CLI preinstalled (ghcr.io/nvidia/aicr:latest) for the recipe/bundle stages, and a kubectl-capable image for snapshot/deploy.

Pattern 3: GitOps Deployment with Argo CD

Use Argo CD for declarative, GitOps-based deployments with automatic sync-wave ordering.

Use case: Automated deployment pipeline with Argo CD

1# GitHub Actions
2name: GitOps Deploy with Argo CD
3on:
4 push:
5 branches: [main]
6
7jobs:
8 generate-and-deploy:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Checkout
12 uses: actions/checkout@v4
13
14 - name: Setup aicr
15 run: |
16 curl -sLO https://github.com/nvidia/aicr/releases/latest/download/aicr_linux_amd64.tar.gz
17 tar -xzf aicr_linux_amd64.tar.gz
18 sudo mv aicr /usr/local/bin/
19
20 - name: Generate recipe
21 run: |
22 aicr recipe \
23 --service eks \
24 --accelerator h100 \
25 --intent training \
26 --os ubuntu \
27 --output recipe.yaml
28
29 - name: Generate Argo CD bundles
30 run: |
31 aicr bundle \
32 --recipe recipe.yaml \
33 --deployer argocd \
34 --repo https://github.com/${{ github.repository }}.git \
35 --output ./bundles
36
37 - name: Commit to GitOps repo
38 run: |
39 # Copy entire bundle to GitOps repository
40 # Argo CD apps are in <component>/argocd/ directories
41 # app-of-apps.yaml is at bundle root
42 cp -r bundles/* gitops-repo/
43
44 cd gitops-repo
45 git add .
46 git commit -m "Update GPU stack components"
47 git push

Generated Argo CD Application with multi-source:

1# bundles/gpu-operator/argocd/application.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5 name: gpu-operator
6 namespace: argocd
7 annotations:
8 argocd.argoproj.io/sync-wave: "1" # Deployed after cert-manager (wave 0)
9spec:
10 project: default
11 sources:
12 # Helm chart from upstream
13 - repoURL: https://helm.ngc.nvidia.com/nvidia
14 chart: gpu-operator
15 targetRevision: v26.3.2
16 helm:
17 valueFiles:
18 - $values/gpu-operator/values.yaml
19 # Values from GitOps repo
20 - repoURL: <YOUR_GIT_REPO>
21 targetRevision: main
22 ref: values
23 # Additional manifests (ClusterPolicy, etc.)
24 - repoURL: <YOUR_GIT_REPO>
25 targetRevision: main
26 path: gpu-operator/manifests
27 destination:
28 server: https://kubernetes.default.svc
29 namespace: gpu-operator
30 syncPolicy:
31 automated:
32 prune: true
33 selfHeal: true
34 syncOptions:
35 - CreateNamespace=true

Pattern 4: Multi-Environment GitOps

Deploy to multiple environments with environment-specific deployers.

$#!/bin/bash
$# multi-env-gitops.sh
$
$ENVIRONMENTS=(
> "staging:helm" # Staging uses Helm per-component bundle
> "production:argocd" # Production uses Argo CD
>)
$
$for env_config in "${ENVIRONMENTS[@]}"; do
$ IFS=":" read -r ENV DEPLOYER <<< "$env_config"
$
$ echo "Generating bundles for $ENV with $DEPLOYER deployer..."
$
$ aicr bundle \
> --recipe "recipes/${ENV}.yaml" \
> --deployer "$DEPLOYER" \
> --output "./bundles/${ENV}"
$
$ echo "Generated $DEPLOYER bundles in ./bundles/${ENV}/"
$done

Monitoring and Alerting

Prometheus Metrics

Scrape AICR API Server:

1# prometheus-config.yaml
2scrape_configs:
3 - job_name: 'aicrd'
4 static_configs:
5 - targets: ['aicrd.default.svc.cluster.local:8080']
6 metrics_path: /metrics

Key metrics:

# Request rate
rate(aicr_http_requests_total[5m])
# Error rate
rate(aicr_http_requests_total{status=~"5.."}[5m])
# Latency (p95)
histogram_quantile(0.95,
rate(aicr_http_request_duration_seconds_bucket[5m])
)
# Rate limit rejections
rate(aicr_rate_limit_rejects_total[5m])

Alerting Rules

1# prometheus-rules.yaml
2groups:
3 - name: aicr_alerts
4 interval: 30s
5 rules:
6 - alert: AICRHighErrorRate
7 expr: |
8 rate(aicr_http_requests_total{status=~"5.."}[5m]) > 0.05
9 for: 5m
10 labels:
11 severity: warning
12 annotations:
13 summary: "AICR API high error rate"
14 description: "Error rate is {{ $value | humanizePercentage }}"
15
16 - alert: AICRHighLatency
17 expr: |
18 histogram_quantile(0.95,
19 rate(aicr_http_request_duration_seconds_bucket[5m])
20 ) > 1
21 for: 5m
22 labels:
23 severity: warning
24 annotations:
25 summary: "AICR API high latency"
26 description: "P95 latency is {{ $value }}s"
27
28 - alert: AICRRateLimitHit
29 expr: |
30 rate(aicr_rate_limit_rejects_total[5m]) > 1
31 for: 5m
32 labels:
33 severity: info
34 annotations:
35 summary: "AICR API rate limit reached"
36 description: "Rate limit rejections: {{ $value }}/s"

Best Practices

1. Caching Recipes

API responses are cacheable (Cache-Control: max-age=300):

1import requests
2from cachetools import TTLCache
3
4# Cache recipes for 5 minutes
5recipe_cache = TTLCache(maxsize=100, ttl=300)
6
7def get_recipe_cached(params):
8 cache_key = frozenset(params.items())
9
10 if cache_key not in recipe_cache:
11 response = requests.get('http://localhost:8080/v1/recipe', params=params)
12 recipe_cache[cache_key] = response.json()
13
14 return recipe_cache[cache_key]

2. Error Handling and Retries

1import requests
2from tenacity import retry, stop_after_attempt, wait_exponential
3
4@retry(
5 stop=stop_after_attempt(3),
6 wait=wait_exponential(multiplier=1, min=4, max=10)
7)
8def get_recipe_with_retry(params):
9 response = requests.get('http://localhost:8080/v1/recipe', params=params)
10 response.raise_for_status()
11 return response.json()

3. Parallel Recipe Generation

1from concurrent.futures import ThreadPoolExecutor
2import requests
3
4def get_recipe(params):
5 response = requests.get('http://localhost:8080/v1/recipe', params=params)
6 return response.json()
7
8# Generate recipes for multiple environments in parallel
9environments = [
10 {'os': 'ubuntu', 'accelerator': 'h100', 'service': 'eks'},
11 {'os': 'ubuntu', 'accelerator': 'gb200', 'service': 'gke'},
12 {'os': 'rhel', 'accelerator': 'a100', 'service': 'aks'},
13]
14
15with ThreadPoolExecutor(max_workers=3) as executor:
16 recipes = list(executor.map(get_recipe, environments))

4. Structured Logging

1import logging
2import json
3
4# Configure structured logging
5logging.basicConfig(
6 level=logging.INFO,
7 format='%(message)s'
8)
9
10def log_recipe_request(params, recipe, duration):
11 logging.info(json.dumps({
12 'event': 'recipe_generated',
13 'params': params,
14 'component_refs': len(recipe.get('componentRefs', [])),
15 'applied_overlays': len(recipe.get('metadata', {}).get('appliedOverlays', [])),
16 'duration_ms': duration * 1000
17 }))

5. Snapshot Versioning

$#!/bin/bash
$# Save snapshots with metadata
$
$CLUSTER="prod-us-east-1"
$TIMESTAMP=$(date +%Y%m%d-%H%M%S)
$OUTPUT="snapshot-${CLUSTER}-${TIMESTAMP}.yaml"
$
$# Capture snapshot from ConfigMap
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > "$OUTPUT"
$
$# Add metadata
$cat << EOF > "${OUTPUT}.meta"
$cluster: $CLUSTER
$timestamp: $TIMESTAMP
$git_commit: $(git rev-parse HEAD)
$k8s_version: $(kubectl version -o json | jq -r '.serverVersion.gitVersion')
$EOF
$
$# Upload to artifact storage
$aws s3 cp "$OUTPUT" "s3://my-bucket/snapshots/"
$aws s3 cp "${OUTPUT}.meta" "s3://my-bucket/snapshots/"

Security Considerations

Note: The API server does not yet provide built-in authentication (API keys or Bearer tokens). Front it with an ingress, service mesh, or API gateway that enforces authn/authz, and restrict reachability with the network policy below.

Network Policies

Restrict AICR agent network access:

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: aicr-agent
5 namespace: gpu-operator
6spec:
7 podSelector:
8 matchLabels:
9 job-name: aicr
10 policyTypes:
11 - Egress
12 egress:
13 - to:
14 - namespaceSelector: {}
15 ports:
16 - protocol: TCP
17 port: 443 # Kubernetes API

Troubleshooting

Debug API Calls

$# Verbose curl
$curl -v "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$
$# With timing
$curl -w "\nTime: %{time_total}s\n" \
> "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$
$# Check headers
$curl -I "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"

Validate Snapshots

$# Check YAML syntax
$yamllint snapshot.yaml
$
$# Validate structure
$yq eval '.measurements | length' snapshot.yaml
$
$# Check for required measurements
$yq eval '.measurements[] | .type' snapshot.yaml | sort -u

Test Recipe Generation

$# Generate and validate
$aicr recipe --os ubuntu --accelerator h100 --output recipe.yaml
$yamllint recipe.yaml
$
$# Check applied overlays
$yq eval '.metadata.appliedOverlays' recipe.yaml
$
$# Extract GPU Operator version from componentRefs
$yq eval '.componentRefs[] | select(.name=="gpu-operator") | .version' recipe.yaml

See Also