Automation | NVIDIA AI Cluster Runtime

Integration patterns for using AICR in automated pipelines.

Overview

Typical integration workflows:

Snapshot capture: Deploy agent Job to capture cluster configuration
Recipe generation: Generate configuration recommendations from snapshot or query parameters
Bundle creation: Create deployment artifacts (Helm values, manifests, scripts)
Deployment: Apply generated configuration to cluster
Validation: Verify deployment using test workloads

Supported CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, Argo Workflows, Tekton

Integration Patterns

Pattern 1: Configuration Snapshot + Drift Detection

Periodically capture snapshots and compare against baseline.

Use case: Detect unauthorized configuration changes

1 # GitHub Actions
2 name: Configuration Drift Detection
3 on:
4   schedule:
5     - cron: '0 */6 * * *'  # Every 6 hours
6 
7 jobs:
8   snapshot:
9     runs-on: ubuntu-latest
10     steps:
11       - name: Configure kubectl
12         uses: azure/k8s-set-context@v4
13         with:
14           kubeconfig: ${{ secrets.KUBECONFIG }}
15       
16       - name: Deploy AICR Agent
17         run: |
18           aicr snapshot --output cm://gpu-operator/aicr-snapshot --timeout 300s
19       
20       - name: Wait for completion
21         run: |
22           kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator
23       
24       - name: Capture snapshot from ConfigMap
25         run: |
26           kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d-%H%M%S).yaml
27       
28       - name: Compare with baseline
29         run: |
30           # Download baseline
31           curl -O https://your-artifacts/baseline.yaml
32           
33           # Compare
34           if ! diff -q baseline.yaml snapshot-*.yaml; then
35             echo "::error::Configuration drift detected"
36             diff baseline.yaml snapshot-*.yaml
37             exit 1
38           fi
39       
40       - name: Upload artifact
41         uses: actions/upload-artifact@v4
42         with:
43           name: cluster-snapshots
44           path: snapshot-*.yaml

Pattern 2: Recipe-Based Deployment

Generate optimized configuration and deploy operators.

Use case: Deploy GPU Operator with environment-specific settings

1 # GitLab CI
2 stages:
3   - snapshot
4   - recipe
5   - bundle
6   - deploy
7 
8 capture_snapshot:
9   stage: snapshot
10   image: bitnami/kubectl:latest
11   script:
12     - aicr snapshot --output snapshot.yaml --timeout 300s
13   artifacts:
14     paths:
15       - snapshot.yaml
16 
17 generate_recipe:
18   stage: recipe
19   image: ghcr.io/nvidia/aicr:latest
20   script:
21     # Option 1: Use ConfigMap directly (no artifact needed)
22     - aicr recipe -s cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow -o recipe.yaml
23     # Option 2: Use snapshot file from previous stage
24     # - aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow --output recipe.yaml
25   artifacts:
26     paths:
27       - recipe.yaml
28   dependencies:
29     - capture_snapshot
30 
31 create_bundle:
32   stage: bundle
33   image: ghcr.io/nvidia/aicr:latest
34   script:
35     - aicr bundle --recipe recipe.yaml --output ./bundles
36     # Override values at bundle generation time
37     # - aicr bundle -r recipe.yaml --set gpuoperator:gds.enabled=true -o ./bundles
38   artifacts:
39     paths:
40       - bundles/
41   dependencies:
42     - generate_recipe
43 
44 deploy_operators:
45   stage: deploy
46   image: bitnami/kubectl:latest
47   script:
48     - cd bundles
49     - sha256sum -c checksums.txt
50     - chmod +x deploy.sh
51     - ./deploy.sh
52   dependencies:
53     - create_bundle
54   when: manual

Pattern 3: API-Driven Recipe Generation

Use API for recipe generation without installing CLI.

Use case: Lightweight recipe generation in containers

1 # CircleCI
2 version: 2.1
3 
4 jobs:
5   generate_recipe:
6     docker:
7       - image: cimg/base:2025.01
8     steps:
9       - run:
10           name: Generate recipe via API
11           command: |
12             # Detect environment
13             OS="ubuntu"
14             ACCELERATOR="h100"
15             SERVICE="eks"
16             
17             # Generate recipe
18             curl -s "http://localhost:8080/v1/recipe?os=${OS}&accelerator=${ACCELERATOR}&service=${SERVICE}&intent=training" \
19               -o recipe.json
20             
21             # Validate
22             jq -e '.measurements | length > 0' recipe.json
23       
24       - persist_to_workspace:
25           root: .
26           paths:
27             - recipe.json
28   
29   extract_versions:
30     docker:
31       - image: cimg/base:2025.01
32     steps:
33       - attach_workspace:
34           at: .
35       
36       - run:
37           name: Extract component versions
38           command: |
39             # GPU Operator version from componentRefs
40             GPU_OP_VERSION=$(jq -r '.componentRefs[] | 
41               select(.name=="gpu-operator") | .version' recipe.json)
42             
43             echo "GPU Operator: $GPU_OP_VERSION"
44             
45             # Save for deployment
46             echo "export GPU_OP_VERSION=$GPU_OP_VERSION" >> $BASH_ENV
47 
48 workflows:
49   deploy:
50     jobs:
51       - generate_recipe
52       - extract_versions:
53           requires:
54             - generate_recipe

Pattern 4: Multi-Cluster Management

Deploy consistent configurations across multiple clusters.

Use case: Multi-region GPU clusters with unified configuration

$ #!/bin/bash
$ # multi-cluster-deploy.sh
$ 
$ # Define clusters
$ CLUSTERS=(
>   "prod-us-east-1:eks:h100"
>   "prod-eu-west-1:eks:h100"
>   "staging-us-west-2:eks:gb200"
> )
$ 
$ # Iterate clusters
$ for cluster_config in "${CLUSTERS[@]}"; do
$   IFS=":" read -r CLUSTER SERVICE GPU <<< "$cluster_config"
$   
$   echo "Processing cluster: $CLUSTER"
$   
$   # Switch context
$   kubectl config use-context "$CLUSTER"
$   
$   # Capture snapshot
$   aicr snapshot --output "snapshot-${CLUSTER}.yaml" --timeout 300s
$   
$   # Generate recipe (can use ConfigMap directly or file)
$   # Option 1: Use ConfigMap
$   aicr recipe -s "cm://gpu-operator/aicr-snapshot" --intent training --platform kubeflow -o "recipe-${CLUSTER}.yaml"
$   # Option 2: Use saved file
$   # aicr recipe --snapshot "snapshot-${CLUSTER}.yaml" --intent training --platform kubeflow --output "recipe-${CLUSTER}.yaml"
$   
$   # Create bundle
$   aicr bundle \
>     --recipe "recipe-${CLUSTER}.yaml" \
>     --output "./bundles/${CLUSTER}"
$ 
$   # Or with value overrides for environment-specific customization
$   # aicr bundle \
>   #   --recipe "recipe-${CLUSTER}.yaml" \
>   #   --set gpuoperator:gds.enabled=true \
>   #   --set gpuoperator:mig.strategy=mixed \
>   #   --output "./bundles/${CLUSTER}"
$   
$   # Deploy (with approval)
$   echo "Deploy to $CLUSTER? [y/N]"
$   read -r response
$   if [[ "$response" =~ ^[Yy]$ ]]; then
$     cd "bundles/${CLUSTER}"
$     chmod +x deploy.sh && ./deploy.sh
$     cd -
$   fi
$   
$   # Clean up
$   kubectl delete job aicr -n gpu-operator
$ done

Pattern 5: GitOps Deployment with Argo CD

Use Argo CD for declarative, GitOps-based deployments with automatic sync-wave ordering.

Use case: Automated deployment pipeline with Argo CD

1 # GitHub Actions
2 name: GitOps Deploy with Argo CD
3 on:
4   push:
5     branches: [main]
6 
7 jobs:
8   generate-and-deploy:
9     runs-on: ubuntu-latest
10     steps:
11       - name: Checkout
12         uses: actions/checkout@v4
13       
14       - name: Setup aicr
15         run: |
16           curl -sLO https://github.com/nvidia/aicr/releases/latest/download/aicr_linux_amd64.tar.gz
17           tar -xzf aicr_linux_amd64.tar.gz
18           sudo mv aicr /usr/local/bin/
19       
20       - name: Generate recipe
21         run: |
22           aicr recipe \
23             --service eks \
24             --accelerator h100 \
25             --intent training \
26             --os ubuntu \
27             --output recipe.yaml
28       
29       - name: Generate Argo CD bundles
30         run: |
31           aicr bundle \
32             --recipe recipe.yaml \
33             --deployer argocd \
34             --repo https://github.com/${{ github.repository }}.git \
35             --output ./bundles
36       
37       - name: Commit to GitOps repo
38         run: |
39           # Copy entire bundle to GitOps repository
40           # Argo CD apps are in <component>/argocd/ directories
41           # app-of-apps.yaml is at bundle root
42           cp -r bundles/* gitops-repo/
43           
44           cd gitops-repo
45           git add .
46           git commit -m "Update GPU stack components"
47           git push

Generated Argo CD Application with multi-source:

1 # bundles/gpu-operator/argocd/application.yaml
2 apiVersion: argoproj.io/v1alpha1
3 kind: Application
4 metadata:
5   name: gpu-operator
6   namespace: argocd
7   annotations:
8     argocd.argoproj.io/sync-wave: "1"  # Deployed after cert-manager (wave 0)
9 spec:
10   project: default
11   sources:
12     # Helm chart from upstream
13     - repoURL: https://helm.ngc.nvidia.com/nvidia
14       chart: gpu-operator
15       targetRevision: v25.3.3
16       helm:
17         valueFiles:
18           - $values/gpu-operator/values.yaml
19     # Values from GitOps repo
20     - repoURL: <YOUR_GIT_REPO>
21       targetRevision: main
22       ref: values
23     # Additional manifests (ClusterPolicy, etc.)
24     - repoURL: <YOUR_GIT_REPO>
25       targetRevision: main
26       path: gpu-operator/manifests
27   destination:
28     server: https://kubernetes.default.svc
29     namespace: gpu-operator
30   syncPolicy:
31     automated:
32       prune: true
33       selfHeal: true
34     syncOptions:
35       - CreateNamespace=true

Pattern 6: Multi-Environment GitOps

Deploy to multiple environments with environment-specific deployers.

$ #!/bin/bash
$ # multi-env-gitops.sh
$ 
$ ENVIRONMENTS=(
>   "staging:helm"       # Staging uses Helm per-component bundle
>   "production:argocd"  # Production uses Argo CD
> )
$ 
$ for env_config in "${ENVIRONMENTS[@]}"; do
$   IFS=":" read -r ENV DEPLOYER <<< "$env_config"
$   
$   echo "Generating bundles for $ENV with $DEPLOYER deployer..."
$   
$   aicr bundle \
>     --recipe "recipes/${ENV}.yaml" \
>     --deployer "$DEPLOYER" \
>     --output "./bundles/${ENV}"
$   
$   echo "Generated $DEPLOYER bundles in ./bundles/${ENV}/"
$ done

Terraform Integration

Module: AICR Agent Deployment

1 # modules/aicr-agent/main.tf
2 
3 # Deploy agent and capture snapshot using CLI
4 resource "null_resource" "capture_snapshot" {
5   provisioner "local-exec" {
6     command = <<-EOT
7       aicr snapshot \
8         --output ${var.snapshot_output} \
9         --timeout 300s
10     EOT
11   }
12 }
13 
14 # Generate recipe (can use ConfigMap directly)
15 resource "null_resource" "generate_recipe" {
16   provisioner "local-exec" {
17     command = <<-EOT
18       aicr recipe \
19         -s cm://gpu-operator/aicr-snapshot \
20         --intent ${var.workload_intent} \
21         -o ${var.recipe_output}
22     EOT
23   }
24   
25   depends_on = [null_resource.wait_for_snapshot]
26 }
27 
28 # variables.tf
29 variable "node_selector" {
30   description = "Node selector for agent pod"
31   type        = map(string)
32   default     = { "nvidia.com/gpu.present" = "true" }
33 }
34 
35 variable "tolerations" {
36   description = "Tolerations for agent pod"
37   type        = list(object({
38     key    = string
39     value  = string
40     effect = string
41   }))
42   default = []
43 }
44 
45 variable "image_version" {
46   description = "AICR image version"
47   type        = string
48   default     = "latest"
49 }
50 
51 variable "snapshot_output" {
52   description = "Path to save snapshot"
53   type        = string
54   default     = "snapshot.yaml"
55 }
56 
57 variable "recipe_output" {
58   description = "Path to save recipe"
59   type        = string
60   default     = "recipe.yaml"
61 }
62 
63 variable "workload_intent" {
64   description = "Workload intent: training or inference"
65   type        = string
66   default     = "training"
67 }
68 
69 # outputs.tf
70 output "snapshot_file" {
71   value = var.snapshot_output
72 }
73 
74 output "recipe_file" {
75   value = var.recipe_output
76 }

Usage:

1 # main.tf
2 module "aicr_agent" {
3   source = "./modules/aicr-agent"
4   
5   node_selector = {
6     "nodeGroup" = "gpu-nodes"
7   }
8   
9   tolerations = [{
10     key    = "nvidia.com/gpu"
11     value  = ""
12     effect = "NoSchedule"
13   }]
14   
15   workload_intent = "training"
16   snapshot_output = "cluster-${var.environment}-snapshot.yaml"
17   recipe_output   = "cluster-${var.environment}-recipe.yaml"
18 }

Kubernetes Operators

Custom Operator: Configuration Drift Watcher

1 // Watch for configuration changes and reconcile
2 package main
3 
4 import (
5     "context"
6     "fmt"
7     "time"
8     
9     "k8s.io/client-go/kubernetes"
10     ctrl "sigs.k8s.io/controller-runtime"
11 )
12 
13 type ConfigReconciler struct {
14     Client    kubernetes.Interface
15     Namespace string
16 }
17 
18 func (r *ConfigReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
19     // 1. Deploy AICR agent
20     if err := r.deployAgent(ctx); err != nil {
21         return ctrl.Result{}, err
22     }
23     
24     // 2. Wait for completion
25     if err := r.waitForJob(ctx); err != nil {
26         return ctrl.Result{}, err
27     }
28     
29     // 3. Retrieve snapshot
30     snapshot, err := r.getSnapshot(ctx)
31     if err != nil {
32         return ctrl.Result{}, err
33     }
34     
35     // 4. Compare with baseline
36     if r.hasConfigDrift(snapshot) {
37         // Alert or auto-remediate
38         fmt.Println("Configuration drift detected!")
39     }
40     
41     // 5. Clean up
42     if err := r.cleanupAgent(ctx); err != nil {
43         return ctrl.Result{}, err
44     }
45     
46     // Requeue after 6 hours
47     return ctrl.Result{RequeueAfter: 6 * time.Hour}, nil
48 }
49 
50 func (r *ConfigReconciler) deployAgent(ctx context.Context) error {
51     // Apply RBAC and Job manifests
52     return nil
53 }
54 
55 func (r *ConfigReconciler) waitForJob(ctx context.Context) error {
56     // Wait for job completion with timeout
57     return nil
58 }
59 
60 func (r *ConfigReconciler) getSnapshot(ctx context.Context) (string, error) {
61     // Retrieve snapshot from ConfigMap
62     return "", nil
63 }
64 
65 func (r *ConfigReconciler) hasConfigDrift(snapshot string) bool {
66     // Compare with baseline
67     return false
68 }
69 
70 func (r *ConfigReconciler) cleanupAgent(ctx context.Context) error {
71     // Delete job
72     return nil
73 }

Monitoring and Alerting

Prometheus Metrics

Scrape AICR API Server:

1 # prometheus-config.yaml
2 scrape_configs:
3   - job_name: 'aicrd'
4     static_configs:
5       - targets: ['aicrd.default.svc.cluster.local:8080']
6     metrics_path: /metrics

Key metrics:

1 # Request rate
2 rate(aicr_http_requests_total[5m])
3 
4 # Error rate
5 rate(aicr_http_requests_total{status=~"5.."}[5m])
6 
7 # Latency (p95)
8 histogram_quantile(0.95, 
9   rate(aicr_http_request_duration_seconds_bucket[5m])
10 )
11 
12 # Rate limit rejections
13 rate(aicr_rate_limit_rejects_total[5m])

Alerting Rules

1 # prometheus-rules.yaml
2 groups:
3   - name: aicr_alerts
4     interval: 30s
5     rules:
6       - alert: AICRHighErrorRate
7         expr: |
8           rate(aicr_http_requests_total{status=~"5.."}[5m]) > 0.05
9         for: 5m
10         labels:
11           severity: warning
12         annotations:
13           summary: "AICR API high error rate"
14           description: "Error rate is {{ $value | humanizePercentage }}"
15       
16       - alert: AICRHighLatency
17         expr: |
18           histogram_quantile(0.95,
19             rate(aicr_http_request_duration_seconds_bucket[5m])
20           ) > 1
21         for: 5m
22         labels:
23           severity: warning
24         annotations:
25           summary: "AICR API high latency"
26           description: "P95 latency is {{ $value }}s"
27       
28       - alert: AICRRateLimitHit
29         expr: |
30           rate(aicr_rate_limit_rejects_total[5m]) > 1
31         for: 5m
32         labels:
33           severity: info
34         annotations:
35           summary: "AICR API rate limit reached"
36           description: "Rate limit rejections: {{ $value }}/s"

Best Practices

1. Caching Recipes

API responses are cacheable (Cache-Control: max-age=300):

1 import requests
2 from cachetools import TTLCache
3 
4 # Cache recipes for 5 minutes
5 recipe_cache = TTLCache(maxsize=100, ttl=300)
6 
7 def get_recipe_cached(params):
8     cache_key = frozenset(params.items())
9     
10     if cache_key not in recipe_cache:
11         response = requests.get('http://localhost:8080/v1/recipe', params=params)
12         recipe_cache[cache_key] = response.json()
13     
14     return recipe_cache[cache_key]

2. Error Handling and Retries

1 import requests
2 from tenacity import retry, stop_after_attempt, wait_exponential
3 
4 @retry(
5     stop=stop_after_attempt(3),
6     wait=wait_exponential(multiplier=1, min=4, max=10)
7 )
8 def get_recipe_with_retry(params):
9     response = requests.get('http://localhost:8080/v1/recipe', params=params)
10     response.raise_for_status()
11     return response.json()

3. Parallel Recipe Generation

1 from concurrent.futures import ThreadPoolExecutor
2 import requests
3 
4 def get_recipe(params):
5     response = requests.get('http://localhost:8080/v1/recipe', params=params)
6     return response.json()
7 
8 # Generate recipes for multiple environments in parallel
9 environments = [
10     {'os': 'ubuntu', 'accelerator': 'h100', 'service': 'eks'},
11     {'os': 'ubuntu', 'accelerator': 'gb200', 'service': 'gke'},
12     {'os': 'rhel', 'accelerator': 'a100', 'service': 'aks'},
13 ]
14 
15 with ThreadPoolExecutor(max_workers=3) as executor:
16     recipes = list(executor.map(get_recipe, environments))

4. Structured Logging

1 import logging
2 import json
3 
4 # Configure structured logging
5 logging.basicConfig(
6     level=logging.INFO,
7     format='%(message)s'
8 )
9 
10 def log_recipe_request(params, recipe, duration):
11     logging.info(json.dumps({
12         'event': 'recipe_generated',
13         'params': params,
14         'component_refs': len(recipe.get('componentRefs', [])),
15         'applied_overlays': len(recipe.get('metadata', {}).get('appliedOverlays', [])),
16         'duration_ms': duration * 1000
17     }))

5. Snapshot Versioning

$ #!/bin/bash
$ # Save snapshots with metadata
$ 
$ CLUSTER="prod-us-east-1"
$ TIMESTAMP=$(date +%Y%m%d-%H%M%S)
$ OUTPUT="snapshot-${CLUSTER}-${TIMESTAMP}.yaml"
$ 
$ # Capture snapshot from ConfigMap
$ kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > "$OUTPUT"
$ 
$ # Add metadata
$ cat << EOF > "${OUTPUT}.meta"
$ cluster: $CLUSTER
$ timestamp: $TIMESTAMP
$ git_commit: $(git rev-parse HEAD)
$ k8s_version: $(kubectl version -o json | jq -r '.serverVersion.gitVersion')
$ EOF
$ 
$ # Upload to artifact storage
$ aws s3 cp "$OUTPUT" "s3://my-bucket/snapshots/"
$ aws s3 cp "${OUTPUT}.meta" "s3://my-bucket/snapshots/"

Security Considerations

API Key Management (Future)

1 import os
2 import requests
3 
4 API_KEY = os.environ.get('AICR_API_KEY')
5 
6 headers = {
7     'Authorization': f'Bearer {API_KEY}',
8     'X-Request-Id': generate_uuid()
9 }
10 
11 response = requests.get(
12     'http://localhost:8080/v1/recipe',
13     params={'os': 'ubuntu', 'gpu': 'h100'},
14     headers=headers
15 )

Network Policies

Restrict AICR agent network access:

1 apiVersion: networking.k8s.io/v1
2 kind: NetworkPolicy
3 metadata:
4   name: aicr-agent
5   namespace: gpu-operator
6 spec:
7   podSelector:
8     matchLabels:
9       job-name: aicr
10   policyTypes:
11     - Egress
12   egress:
13     - to:
14         - namespaceSelector: {}
15       ports:
16         - protocol: TCP
17           port: 443  # Kubernetes API

Secrets Management

1 # kubernetes-secret.yaml
2 apiVersion: v1
3 kind: Secret
4 metadata:
5   name: aicr-credentials
6   namespace: gpu-operator
7 type: Opaque
8 stringData:
9   api-key: your-api-key-here

1 # Reference in pod
2 env:
3   - name: AICR_API_KEY
4     valueFrom:
5       secretKeyRef:
6         name: aicr-credentials
7         key: api-key

Troubleshooting

Debug API Calls

$ # Verbose curl
$ curl -v "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$ 
$ # With timing
$ curl -w "\nTime: %{time_total}s\n" \
>   "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"
$ 
$ # Check headers
$ curl -I "http://localhost:8080/v1/recipe?os=ubuntu&accelerator=h100"

Validate Snapshots

$ # Check YAML syntax
$ yamllint snapshot.yaml
$ 
$ # Validate structure
$ yq eval '.measurements | length' snapshot.yaml
$ 
$ # Check for required measurements
$ yq eval '.measurements[] | .type' snapshot.yaml | sort -u

Test Recipe Generation

$ # Generate and validate
$ aicr recipe --os ubuntu --accelerator h100 --output recipe.yaml
$ yamllint recipe.yaml
$ 
$ # Check applied overlays
$ yq eval '.metadata.appliedOverlays' recipe.yaml
$ 
$ # Extract GPU Operator version from componentRefs
$ yq eval '.componentRefs[] | select(.name=="gpu-operator") | .version' recipe.yaml

Automation and CI/CD Integration