Agent Deployment
Deploy AICR as a Kubernetes Job to automatically capture cluster configuration snapshots.
Overview
The agent is a Kubernetes Job that captures system configuration and writes output to a ConfigMap.
Deployment: Use aicr snapshot to deploy and manage the Job programmatically.
What it does:
- Runs
aicr snapshot --output cm://gpu-operator/aicr-snapshoton a GPU node - Writes snapshot to ConfigMap via Kubernetes API (no PersistentVolume required)
- Exits after snapshot capture
What it does not do:
- Recipe generation (use
aicr recipeCLI or API server) - Bundle generation (use
aicr bundleCLI) - Continuous monitoring (use CronJob for periodic snapshots)
Use cases:
- Cluster auditing and compliance
- Multi-cluster configuration management
- Drift detection (compare snapshots over time)
- CI/CD integration (automated configuration validation)
ConfigMap storage
Agent uses ConfigMap URI scheme (cm://namespace/name) to write snapshots:
This creates:
Prerequisites
- Kubernetes cluster with GPU nodes
- aicr CLI installed
- GPU Operator installed (or appropriate namespace configured via
--namespace) - Cluster admin permissions (for RBAC setup)
Quick Start
1. Deploy Agent with Single Command
This single command:
- Creates RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding)
- Deploys Job to capture snapshot
- Waits for Job completion (5m timeout by default)
- Retrieves snapshot from ConfigMap
- Writes snapshot to stdout (or specified output)
- Cleans up Job and RBAC resources (use
--no-cleanupto keep for debugging)
2. View Snapshot Output
Snapshot is written to specified output:
3. Customize Deployment
Target specific nodes and configure scheduling:
Available flags:
--kubeconfig: Custom kubeconfig path (default:~/.kube/configor$KUBECONFIG)--namespace: Deployment namespace (default:default)--image: Container image (default:ghcr.io/nvidia/aicr:latest)--job-name: Job name (default:aicr)--service-account-name: ServiceAccount name (default:aicr)--node-selector: Node selector (format:key=value, repeatable)--toleration: Toleration (format:key=value:effect, repeatable). Default: all taints are tolerated (usesoperator: Existswithout key). Only specify this flag if you want to restrict which taints the Job can tolerate.--timeout: Wait timeout (default:5m)--no-cleanup: Skip removal of Job and RBAC resources on completion. Warning: leaves a cluster-admin ClusterRoleBinding active.
4. Check Agent Logs (Debugging)
If something goes wrong, check Job logs:
Customization
Node Selection
Target specific GPU nodes using --node-selector:
Common node selectors:
Tolerations
By default, the agent Job tolerates all taints using the universal toleration (operator: Exists without a key). Only specify --toleration flags to restrict which taints are tolerated.
Common tolerations:
Image Version
Pin to a specific version:
Finding versions:
- GitHub Releases
- Container registry: ghcr.io/nvidia/aicr
Post-Deployment
Retrieve Snapshot
Generate Recipe from Snapshot
Complete Workflow
Integration Patterns
CI/CD Pipeline
Multi-Cluster Auditing
Drift Detection
Troubleshooting
Job Fails to Start
Check RBAC permissions:
Job Pending
Check node selectors and tolerations:
Job Completes but No Output
Check ConfigMap and container logs:
Permission Denied
Ensure RBAC is correctly deployed:
Security Considerations
RBAC Permissions
The agent requires these permissions (created automatically by the CLI):
- ClusterRole (
aicr-node-reader): Read access to nodes, pods, and ClusterPolicy CRDs (nvidia.com) - Role (
aicr): Create/update ConfigMaps and list pods in the deployment namespace
Pod Security Context
The agent requires elevated privileges to collect system configuration from the host:
hostPID,hostNetwork,hostIPC: Required to read host system configurationprivileged+SYS_ADMIN: Required to access GPU configuration and kernel parameters/run/systemdmount: Required to query systemd service states
See Also
- CLI Reference - aicr CLI commands
- Installation Guide - Install CLI locally
- API Reference - REST API usage
- Kubernetes Deployment - API server deployment