Kubernetes Deployment

View as Markdown

Deploy the AICR API Server in your Kubernetes cluster for self-hosted recipe generation.

Overview

API Server deployment enables self-hosted recipe generation:

  • Isolated deployment: Recipe data stays within your infrastructure
  • Custom recipes: Modify embedded recipe data (see recipes/)
  • High availability: Deploy multiple replicas with load balancing
  • Observability: Prometheus /metrics endpoint and structured logging

API Server scope:

  • Recipe generation from query parameters (query mode)
  • Does not capture snapshots (use agent Job or CLI)
  • Generates bundles via POST /v1/bundle
  • Does not analyze snapshots (query mode only)

Agent deployment (separate component):

  • Kubernetes Job captures cluster configuration
  • Writes snapshot to ConfigMap via Kubernetes API
  • Requires RBAC: ServiceAccount with ConfigMap create/update permissions
  • See Agent Deployment

Typical workflow:

  1. Deploy agent Job → Captures snapshot → Writes to ConfigMap
  2. CLI reads ConfigMap → Generates recipe → Writes to file or ConfigMap
  3. CLI reads recipe → Generates bundle → Writes to filesystem
  4. Apply bundle to cluster (Helm install, kubectl apply)

Quick Start

$# Create namespace
$kubectl create namespace aicr
$
$# Deploy API server (save the manifest from the Deployment section below as aicrd-deployment.yaml)
$kubectl apply -f aicrd-deployment.yaml
$
$# Check deployment
$kubectl get pods -n aicr
$kubectl get svc -n aicr

Helm chart: Not yet available. Use the manual manifests below.

Manual Deployment

1. Create Namespace

1# namespace.yaml
2apiVersion: v1
3kind: Namespace
4metadata:
5 name: aicr
6 labels:
7 app: aicrd
$kubectl apply -f namespace.yaml

2. Create Deployment

1# deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: aicrd
6 namespace: aicr
7 labels:
8 app: aicrd
9spec:
10 replicas: 3
11 selector:
12 matchLabels:
13 app: aicrd
14 template:
15 metadata:
16 labels:
17 app: aicrd
18 annotations:
19 prometheus.io/scrape: "true"
20 prometheus.io/port: "8080"
21 prometheus.io/path: "/metrics"
22 spec:
23 securityContext:
24 runAsNonRoot: true
25 runAsUser: 65532
26 fsGroup: 65532
27
28 containers:
29 - name: api-server
30 image: ghcr.io/nvidia/aicrd:latest
31 imagePullPolicy: IfNotPresent
32
33 ports:
34 - name: http
35 containerPort: 8080
36 protocol: TCP
37
38 env:
39 - name: PORT
40 value: "8080"
41 - name: AICR_LOG_LEVEL
42 value: "info"
43
44 livenessProbe:
45 httpGet:
46 path: /health
47 port: http
48 initialDelaySeconds: 10
49 periodSeconds: 30
50 timeoutSeconds: 5
51 failureThreshold: 3
52
53 readinessProbe:
54 httpGet:
55 path: /ready
56 port: http
57 initialDelaySeconds: 5
58 periodSeconds: 10
59 timeoutSeconds: 5
60 failureThreshold: 3
61
62 resources:
63 requests:
64 cpu: 100m
65 memory: 128Mi
66 limits:
67 cpu: 500m
68 memory: 512Mi
69
70 securityContext:
71 allowPrivilegeEscalation: false
72 readOnlyRootFilesystem: true
73 capabilities:
74 drop: ["ALL"]
$kubectl apply -f deployment.yaml

3. Create Service

1# service.yaml
2apiVersion: v1
3kind: Service
4metadata:
5 name: aicrd
6 namespace: aicr
7 labels:
8 app: aicrd
9spec:
10 type: ClusterIP
11 selector:
12 app: aicrd
13 ports:
14 - name: http
15 port: 80
16 targetPort: http
17 protocol: TCP
$kubectl apply -f service.yaml

4. Create Ingress (Optional)

1# ingress.yaml
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5 name: aicrd
6 namespace: aicr
7 annotations:
8 cert-manager.io/cluster-issuer: letsencrypt-prod
9 nginx.ingress.kubernetes.io/rate-limit: "100"
10spec:
11 ingressClassName: nginx
12 tls:
13 - hosts:
14 - aicr.yourdomain.com
15 secretName: aicr-tls
16 rules:
17 - host: aicr.yourdomain.com
18 http:
19 paths:
20 - path: /
21 pathType: Prefix
22 backend:
23 service:
24 name: aicrd
25 port:
26 number: 80
$kubectl apply -f ingress.yaml

Agent Deployment

Deploy the AICR Agent as a Kubernetes Job to automatically capture cluster configuration.

1. Create RBAC Resources

1# agent-rbac.yaml
2apiVersion: v1
3kind: ServiceAccount
4metadata:
5 name: aicr
6 namespace: gpu-operator
7---
8apiVersion: rbac.authorization.k8s.io/v1
9kind: Role
10metadata:
11 name: aicr
12 namespace: gpu-operator
13rules:
14- apiGroups: [""]
15 resources: ["configmaps"]
16 verbs: ["get", "list", "create", "update", "patch"]
17---
18apiVersion: rbac.authorization.k8s.io/v1
19kind: RoleBinding
20metadata:
21 name: aicr
22 namespace: gpu-operator
23roleRef:
24 apiGroup: rbac.authorization.k8s.io
25 kind: Role
26 name: aicr
27subjects:
28- kind: ServiceAccount
29 name: aicr
30 namespace: gpu-operator # Must match ServiceAccount namespace
31---
32apiVersion: rbac.authorization.k8s.io/v1
33kind: ClusterRole
34metadata:
35 name: aicr
36rules:
37- apiGroups: [""]
38 resources: ["nodes", "pods"]
39 verbs: ["get", "list"]
40- apiGroups: ["nvidia.com"]
41 resources: ["clusterpolicies"]
42 verbs: ["get", "list"]
43---
44apiVersion: rbac.authorization.k8s.io/v1
45kind: ClusterRoleBinding
46metadata:
47 name: aicr
48roleRef:
49 apiGroup: rbac.authorization.k8s.io
50 kind: ClusterRole
51 name: aicr
52subjects:
53- kind: ServiceAccount
54 name: aicr
55 namespace: gpu-operator
$kubectl apply -f agent-rbac.yaml

2. Create Agent Job

1# agent-job.yaml
2apiVersion: batch/v1
3kind: Job
4metadata:
5 name: aicr
6 namespace: gpu-operator
7 labels:
8 app: aicr-agent
9spec:
10 template:
11 metadata:
12 labels:
13 app: aicr-agent
14 spec:
15 serviceAccountName: aicr
16 restartPolicy: Never
17
18 containers:
19 - name: aicr
20 image: ghcr.io/nvidia/aicr:latest
21 imagePullPolicy: IfNotPresent
22
23 command:
24 - aicr
25 - snapshot
26 - --output
27 - cm://gpu-operator/aicr-snapshot
28
29 securityContext:
30 privileged: true
31 runAsUser: 0
32 runAsGroup: 0
33 hostPID: true
34 hostNetwork: true
35 hostIPC: true
36 volumes:
37 - name: systemd
38 hostPath:
39 path: /run/systemd
40 type: Directory

Note: The agent defaults to privileged mode, which is required for GPU, SystemD, and OS collectors. For PSS-restricted namespaces where only the Kubernetes collector is needed, use --privileged=false when deploying via the CLI. See Agent Deployment for details.

$kubectl apply -f agent-job.yaml
$
$# Wait for completion
$kubectl wait --for=condition=complete job/aicr -n gpu-operator --timeout=5m
$
$# Verify ConfigMap was created
$kubectl get configmap aicr-snapshot -n gpu-operator
$
$# View snapshot data
$kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'

3. Generate Recipe from ConfigMap

$# Using CLI (local or in another Job)
$aicr recipe --snapshot cm://gpu-operator/aicr-snapshot \
> --intent training \
> --platform kubeflow \
> --output recipe.yaml
$
$# Or write recipe back to ConfigMap
$aicr recipe --snapshot cm://gpu-operator/aicr-snapshot \
> --intent training \
> --platform kubeflow \
> --output cm://gpu-operator/aicr-recipe

4. Generate Bundle

$# From file
$aicr bundle --recipe recipe.yaml --output ./bundles
$
$# From ConfigMap
$aicr bundle --recipe cm://gpu-operator/aicr-recipe --output ./bundles

E2E Testing

Validate the complete workflow:

$# Run all CLI integration tests (no cluster needed)
$make e2e
$
$# Run cluster-based E2E tests (requires Kind cluster)
$make e2e-tilt

CLI tests use Kyverno Chainsaw for declarative YAML assertions. See tests/chainsaw/README.md for details.

Configuration Options

Environment Variables

VariableDefaultDescription
PORT8080HTTP server port
AICR_LOG_LEVELinfoLogging level: debug, info, warn, error
RATE_LIMIT100Requests per second
RATE_BURST200Burst capacity
READ_TIMEOUT30sHTTP read timeout
WRITE_TIMEOUT30sHTTP write timeout
IDLE_TIMEOUT60sHTTP idle timeout

Note: The API server uses structured JSON logging to stderr. The CLI supports three logging modes (CLI/Text/JSON), but the API server always uses JSON for consistent log aggregation.

ConfigMap for Custom Recipe Data (Advanced)

Note: This example shows the concept of mounting custom recipe data. The actual recipe format uses a base-plus-overlay architecture. See recipes/ for the current schema (overlays/*.yaml including base.yaml).

1# configmap.yaml - Example showing custom recipe data mounting
2apiVersion: v1
3kind: ConfigMap
4metadata:
5 name: aicr-recipe-data
6 namespace: aicr
7data:
8 overlays/base.yaml: |
9 # Your custom base recipe
10 apiVersion: aicr.nvidia.com/v1alpha1
11 kind: RecipeMetadata
12 # ... (see recipes/overlays/base.yaml for schema)

Mount in deployment:

1spec:
2 template:
3 spec:
4 volumes:
5 - name: recipe-data
6 configMap:
7 name: aicr-recipe-data
8 containers:
9 - name: api-server
10 volumeMounts:
11 - name: recipe-data
12 mountPath: /data
13 env:
14 - name: RECIPE_DATA_PATH
15 value: /data

High Availability

Horizontal Pod Autoscaler

1# hpa.yaml
2apiVersion: autoscaling/v2
3kind: HorizontalPodAutoscaler
4metadata:
5 name: aicrd
6 namespace: aicr
7spec:
8 scaleTargetRef:
9 apiVersion: apps/v1
10 kind: Deployment
11 name: aicrd
12 minReplicas: 3
13 maxReplicas: 10
14 metrics:
15 - type: Resource
16 resource:
17 name: cpu
18 target:
19 type: Utilization
20 averageUtilization: 70
21 - type: Resource
22 resource:
23 name: memory
24 target:
25 type: Utilization
26 averageUtilization: 80
27 behavior:
28 scaleDown:
29 stabilizationWindowSeconds: 300
30 policies:
31 - type: Percent
32 value: 50
33 periodSeconds: 60
34 scaleUp:
35 stabilizationWindowSeconds: 0
36 policies:
37 - type: Percent
38 value: 100
39 periodSeconds: 15
$kubectl apply -f hpa.yaml

Pod Disruption Budget

1# pdb.yaml
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5 name: aicrd
6 namespace: aicr
7spec:
8 minAvailable: 2
9 selector:
10 matchLabels:
11 app: aicrd
$kubectl apply -f pdb.yaml

Monitoring

Prometheus ServiceMonitor

1# servicemonitor.yaml
2apiVersion: monitoring.coreos.com/v1
3kind: ServiceMonitor
4metadata:
5 name: aicrd
6 namespace: aicr
7 labels:
8 app: aicrd
9spec:
10 selector:
11 matchLabels:
12 app: aicrd
13 endpoints:
14 - port: http
15 path: /metrics
16 interval: 30s
17 scrapeTimeout: 10s
$kubectl apply -f servicemonitor.yaml

Grafana Dashboard

Key panels:

  • Request rate (by status code)
  • Request duration (p50, p95, p99)
  • Error rate
  • Rate limit rejections
  • Active connections

Security

Network Policies

1# networkpolicy.yaml
2apiVersion: networking.k8s.io/v1
3kind: NetworkPolicy
4metadata:
5 name: aicrd
6 namespace: aicr
7spec:
8 podSelector:
9 matchLabels:
10 app: aicrd
11 policyTypes:
12 - Ingress
13 - Egress
14 ingress:
15 - from:
16 - namespaceSelector: {}
17 ports:
18 - protocol: TCP
19 port: 8080
20 egress:
21 - to:
22 - namespaceSelector: {}
23 ports:
24 - protocol: TCP
25 port: 53 # DNS
26 - to:
27 - namespaceSelector:
28 matchLabels:
29 name: kube-system
30 ports:
31 - protocol: TCP
32 port: 443 # Kubernetes API

Pod Security Standards

1# Add to namespace
2apiVersion: v1
3kind: Namespace
4metadata:
5 name: aicr
6 labels:
7 pod-security.kubernetes.io/enforce: restricted
8 pod-security.kubernetes.io/audit: restricted
9 pod-security.kubernetes.io/warn: restricted

RBAC (If API server needs K8s access)

1# serviceaccount.yaml
2apiVersion: v1
3kind: ServiceAccount
4metadata:
5 name: aicrd
6 namespace: aicr
7
8---
9# role.yaml
10apiVersion: rbac.authorization.k8s.io/v1
11kind: ClusterRole
12metadata:
13 name: aicrd
14rules:
15 - apiGroups: [""]
16 resources: ["nodes", "pods"]
17 verbs: ["get", "list"]
18
19---
20# rolebinding.yaml
21apiVersion: rbac.authorization.k8s.io/v1
22kind: ClusterRoleBinding
23metadata:
24 name: aicrd
25roleRef:
26 apiGroup: rbac.authorization.k8s.io
27 kind: ClusterRole
28 name: aicrd
29subjects:
30 - kind: ServiceAccount
31 name: aicrd
32 namespace: aicr

Troubleshooting

Check Pod Status

$# Pod status
$kubectl get pods -n aicr
$
$# Describe pod
$kubectl describe pod -n aicr -l app=aicrd
$
$# View logs
$kubectl logs -n aicr -l app=aicrd
$
$# Follow logs
$kubectl logs -n aicr -l app=aicrd -f

Check Service

$# Service status
$kubectl get svc -n aicr
$
$# Endpoints
$kubectl get endpoints -n aicr
$
$# Test from within cluster
$kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
> curl http://aicrd.aicr.svc.cluster.local/health

Check Ingress

$# Ingress status
$kubectl get ingress -n aicr
$
$# Describe ingress
$kubectl describe ingress aicrd -n aicr
$
$# Check cert-manager certificate
$kubectl get certificate -n aicr

Performance Issues

$# Check resource usage
$kubectl top pods -n aicr
$
$# Check HPA status
$kubectl get hpa -n aicr
$
$# Check metrics
$kubectl exec -n aicr -it deploy/aicrd -- \
> wget -qO- http://localhost:8080/metrics

Connection Refused

  1. Check service exists: kubectl get svc -n aicr
  2. Check endpoints: kubectl get endpoints -n aicr
  3. Check pod is ready: kubectl get pods -n aicr
  4. Check readiness probe: kubectl describe pod -n aicr <pod-name>

Rate Limiting

Check rate limit settings:

$kubectl exec -n aicr deploy/aicrd -- env | grep RATE

Adjust via deployment:

1env:
2 - name: RATE_LIMIT
3 value: "200" # Increase limit
4 - name: RATE_BURST
5 value: "400"

Upgrading

Rolling Update

$# Update image
$kubectl set image deployment/aicrd \
> api-server=ghcr.io/nvidia/aicrd:v0.8.0 \
> -n aicr
$
$# Watch rollout
$kubectl rollout status deployment/aicrd -n aicr
$
$# Rollback if needed
$kubectl rollout undo deployment/aicrd -n aicr

Blue-Green Deployment

$# Deploy new version
$kubectl apply -f deployment-v2.yaml
$
$# Switch service
$kubectl patch service aicrd -n aicr \
> -p '{"spec":{"selector":{"version":"v2"}}}'
$
$# Delete old deployment
$kubectl delete deployment aicrd-v1 -n aicr

Backup and Disaster Recovery

Export Configuration

$# Export all resources
$kubectl get all -n aicr -o yaml > aicr-backup.yaml
$
$# Export specific resources
$kubectl get deployment,service,ingress -n aicr -o yaml > aicr-config.yaml

Restore from Backup

$# Restore namespace and resources
$kubectl apply -f aicr-backup.yaml

Cost Optimization

Resource Limits

Start with minimal resources:

1resources:
2 requests:
3 cpu: 50m
4 memory: 64Mi
5 limits:
6 cpu: 200m
7 memory: 256Mi

Monitor and adjust based on usage.

Vertical Pod Autoscaler (Optional)

1# vpa.yaml
2apiVersion: autoscaling.k8s.io/v1
3kind: VerticalPodAutoscaler
4metadata:
5 name: aicrd
6 namespace: aicr
7spec:
8 targetRef:
9 apiVersion: apps/v1
10 kind: Deployment
11 name: aicrd
12 updatePolicy:
13 updateMode: "Auto"

See Also