Agent Deployment

View as Markdown

Deploy AICR as a Kubernetes Job to automatically capture cluster configuration snapshots.

Overview

The agent is a Kubernetes Job that captures system configuration and writes output to a ConfigMap.

Deployment: Use aicr snapshot to deploy and manage the Job programmatically.

What it does:

  • Runs aicr snapshot --output cm://gpu-operator/aicr-snapshot on a GPU node
  • Writes snapshot to ConfigMap via Kubernetes API (no PersistentVolume required)
  • Exits after snapshot capture

What it does not do:

  • Recipe generation (use aicr recipe CLI or API server)
  • Bundle generation (use aicr bundle CLI)
  • Continuous monitoring (use CronJob for periodic snapshots)

Use cases:

  • Cluster auditing and compliance
  • Multi-cluster configuration management
  • Drift detection (compare snapshots over time)
  • CI/CD integration (automated configuration validation)

ConfigMap storage

Agent uses ConfigMap URI scheme (cm://namespace/name) to write snapshots:

$aicr snapshot --output cm://gpu-operator/aicr-snapshot

This creates:

1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: aicr-snapshot
5 namespace: gpu-operator
6 labels:
7 app.kubernetes.io/name: aicr
8 app.kubernetes.io/component: snapshot
9 app.kubernetes.io/version: v0.17.0
10data:
11 snapshot.yaml: | # Complete snapshot YAML
12 apiVersion: aicr.nvidia.com/v1alpha1
13 kind: Snapshot
14 measurements: [...]
15 format: yaml
16 timestamp: "2026-01-03T10:30:00Z"

Prerequisites

  • Kubernetes cluster with GPU nodes
  • aicr CLI installed
  • GPU Operator installed (or appropriate namespace configured via --namespace)
  • Cluster admin permissions (for RBAC setup)

Quick Start

1. Deploy Agent with Single Command

$aicr snapshot

This single command:

  1. Creates RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding)
  2. Deploys Job to capture snapshot
  3. Waits for Job completion (5m timeout by default)
  4. Retrieves snapshot from ConfigMap
  5. Writes snapshot to stdout (or specified output)
  6. Cleans up Job and RBAC resources (use --no-cleanup to keep for debugging)

2. View Snapshot Output

Snapshot is written to specified output:

$# Output to stdout (default)
$aicr snapshot
$
$# Save to file
$aicr snapshot --output snapshot.yaml
$
$# Keep in ConfigMap for later use
$aicr snapshot --output cm://gpu-operator/aicr-snapshot
$
$# Retrieve from ConfigMap later
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'

3. Customize Deployment

Target specific nodes and configure scheduling:

$# Target GPU nodes with specific label
$aicr snapshot \
> --node-selector accelerator=nvidia-h100
$
$# Handle tainted nodes (by default all taints are tolerated)
$# Only needed if you want to restrict which taints are tolerated
$aicr snapshot \
> --toleration nvidia.com/gpu=present:NoSchedule
$
$# Full customization
$aicr snapshot \
> --namespace gpu-operator \
> --image ghcr.io/nvidia/aicr:v0.8.0 \
> --node-selector accelerator=nvidia-h100 \
> --toleration nvidia.com/gpu:NoSchedule \
> --timeout 10m \
> --output cm://gpu-operator/aicr-snapshot

Available flags:

  • --kubeconfig: Custom kubeconfig path (default: ~/.kube/config or $KUBECONFIG)
  • --namespace: Deployment namespace (default: default)
  • --image: Container image (default: ghcr.io/nvidia/aicr:latest)
  • --job-name: Job name (default: aicr)
  • --service-account-name: ServiceAccount name (default: aicr)
  • --node-selector: Node selector (format: key=value, repeatable)
  • --toleration: Toleration (format: key=value:effect, repeatable). Default: all taints are tolerated (uses operator: Exists without key). Only specify this flag if you want to restrict which taints the Job can tolerate.
  • --timeout: Wait timeout (default: 5m)
  • --no-cleanup: Skip removal of Job and RBAC resources on completion. Warning: leaves a cluster-admin ClusterRoleBinding active.

4. Check Agent Logs (Debugging)

If something goes wrong, check Job logs:

$# Get Job status
$kubectl get jobs -n gpu-operator
$
$# View logs
$kubectl logs -n gpu-operator job/aicr
$
$# Describe Job for events
$kubectl describe job aicr -n gpu-operator

Customization

Node Selection

Target specific GPU nodes using --node-selector:

$aicr snapshot --node-selector nvidia.com/gpu.present=true

Common node selectors:

SelectorPurpose
nvidia.com/gpu.present=trueAny node with GPU
nodeGroup=gpu-nodesSpecific node pool (EKS/GKE)
node.kubernetes.io/instance-type=p4d.24xlargeAWS instance type
cloud.google.com/gke-accelerator=nvidia-tesla-h100GKE GPU type

Tolerations

By default, the agent Job tolerates all taints using the universal toleration (operator: Exists without a key). Only specify --toleration flags to restrict which taints are tolerated.

Common tolerations:

Taint KeyEffectPurpose
nvidia.com/gpuNoScheduleGPU Operator default
dedicatedNoScheduleDedicated GPU nodes
workloadNoScheduleWorkload-specific nodes

Image Version

Pin to a specific version:

$aicr snapshot --image ghcr.io/nvidia/aicr:v0.8.0

Finding versions:

Post-Deployment

Retrieve Snapshot

$# View snapshot from ConfigMap
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'
$
$# Save to file
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d).yaml

Generate Recipe from Snapshot

$# Use ConfigMap directly (no file needed)
$aicr recipe --snapshot cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow --output recipe.yaml
$
$# Generate bundle
$aicr bundle --recipe recipe.yaml --output ./bundles

Complete Workflow

$# Step 1: Capture snapshot to ConfigMap
$aicr snapshot --output cm://gpu-operator/aicr-snapshot
$
$# Step 2: Generate recipe from ConfigMap
$aicr recipe \
> --snapshot cm://gpu-operator/aicr-snapshot \
> --intent training \
> --platform kubeflow \
> --output recipe.yaml
$
$# Step 3: Create deployment bundle
$aicr bundle \
> --recipe recipe.yaml \
> --output ./bundles
$
$# Step 4: Deploy to cluster
$cd bundles && chmod +x deploy.sh && ./deploy.sh
$
$# Step 5: Verify deployment
$kubectl get pods -n gpu-operator
$kubectl logs -n gpu-operator -l app=nvidia-operator-validator

Integration Patterns

CI/CD Pipeline

1# GitHub Actions example
2- name: Capture snapshot using agent
3 run: |
4 aicr snapshot \
5 --namespace gpu-operator \
6 --output cm://gpu-operator/aicr-snapshot \
7 --timeout 10m
8
9- name: Generate recipe from ConfigMap
10 run: |
11 aicr recipe \
12 --snapshot cm://gpu-operator/aicr-snapshot \
13 --intent training \
14 --output recipe.yaml
15
16- name: Generate bundle
17 run: |
18 aicr bundle -r recipe.yaml -o ./bundles
19
20- name: Upload artifacts
21 uses: actions/upload-artifact@v4
22 with:
23 name: cluster-config
24 path: |
25 recipe.yaml
26 bundles/

Multi-Cluster Auditing

$#!/bin/bash
$# Capture snapshots from multiple clusters
$
$clusters=("prod-us-east" "prod-eu-west" "staging")
$
$for cluster in "${clusters[@]}"; do
$ echo "Capturing snapshot from $cluster..."
$
$ # Switch context
$ kubectl config use-context $cluster
$
$ # Deploy agent and capture snapshot
$ aicr snapshot \
> --namespace gpu-operator \
> --output snapshot-${cluster}.yaml \
> --timeout 10m
$done

Drift Detection

$#!/bin/bash
$# Compare current snapshot with baseline
$
$# Baseline (first snapshot)
$aicr snapshot --output baseline.yaml
$
$# Current (later snapshot)
$aicr snapshot --output current.yaml
$
$# Compare
$diff baseline.yaml current.yaml || echo "Configuration drift detected!"

Troubleshooting

Job Fails to Start

Check RBAC permissions:

$kubectl auth can-i get nodes --as=system:serviceaccount:gpu-operator:aicr
$kubectl auth can-i get pods --as=system:serviceaccount:gpu-operator:aicr

Job Pending

Check node selectors and tolerations:

$# View pod events
$kubectl describe pod -n gpu-operator -l job-name=aicr
$
$# Check node labels
$kubectl get nodes --show-labels
$
$# Check node taints
$kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Job Completes but No Output

Check ConfigMap and container logs:

$# Check if ConfigMap was created
$kubectl get configmap aicr-snapshot -n gpu-operator
$
$# View ConfigMap contents
$kubectl get configmap aicr-snapshot -n gpu-operator -o yaml
$
$# View pod logs for errors
$kubectl logs -n gpu-operator -l job-name=aicr

Permission Denied

Ensure RBAC is correctly deployed:

$# Verify ClusterRole
$kubectl get clusterrole aicr-node-reader
$
$# Verify ClusterRoleBinding
$kubectl get clusterrolebinding aicr-node-reader
$
$# Verify Role and RoleBinding
$kubectl get role aicr -n gpu-operator
$kubectl get rolebinding aicr -n gpu-operator
$
$# Verify ServiceAccount
$kubectl get serviceaccount aicr -n gpu-operator

Security Considerations

RBAC Permissions

The agent requires these permissions (created automatically by the CLI):

  • ClusterRole (aicr-node-reader): Read access to nodes, pods, and ClusterPolicy CRDs (nvidia.com)
  • Role (aicr): Create/update ConfigMaps and list pods in the deployment namespace

Pod Security Context

The agent requires elevated privileges to collect system configuration from the host:

  • hostPID, hostNetwork, hostIPC: Required to read host system configuration
  • privileged + SYS_ADMIN: Required to access GPU configuration and kernel parameters
  • /run/systemd mount: Required to query systemd service states

See Also