Talos integration

View as Markdown

Talos Linux enforces stricter pod-security defaults than most managed Kubernetes distributions, and several AICR components need privileged access (host filesystem, host network, root user) to do their jobs. AICR ships an opt-in os-talos mixin that handles the namespace and Pod Security Admission (PSA) bookkeeping for the affected components so a recipe author doesn’t have to.

This page documents what the mixin does and why. It assumes you already have a Talos cluster reachable via kubectl and that you’ve read /aicr/integrator-guide/recipe-development for the general recipe / overlay / mixin model.

What happens when you select os=talos

A leaf overlay that targets Talos opts into the mixin:

1spec:
2 base: <some-base-overlay>
3 mixins:
4 - os-talos
5 criteria:
6 os: talos

When aicr bundle resolves that recipe, the os-talos mixin:

  1. Overrides the install namespace of five components to privileged-<component>.
  2. Attaches a per-component manifests/talos-namespace.yaml manifest (a Namespace resource with PSA-privileged labels) via the preManifestFiles field — so the bundler emits a -pre folder at sync-wave N-1 ahead of the corresponding chart at wave N.
  3. Adds a OS.release.ID == talos constraint to the recipe so the bundle won’t silently install on a non-Talos cluster.

No operator pre-cluster setup is required. kubectl apply -f or an Argo CD GitOps sync handles namespace creation and chart install in the right order.

Namespaces the mixin creates

ComponentNamespaceManifest source
gpu-operatorprivileged-gpu-operatorrecipes/components/gpu-operator/manifests/talos-namespace.yaml
network-operatorprivileged-network-operatorrecipes/components/network-operator/manifests/talos-namespace.yaml
nvsentinelprivileged-nvsentinelrecipes/components/nvsentinel/manifests/talos-namespace.yaml
nvidia-dra-driver-gpuprivileged-nvidia-dra-driver-gpurecipes/components/nvidia-dra-driver-gpu/manifests/talos-namespace.yaml
nodewright-operatorprivileged-nodewright-operatorrecipes/components/nodewright-operator/manifests/talos-namespace.yaml

Why these components run privileged

  • gpu-operator — installs NVIDIA drivers via DaemonSets that need privileged: true, host paths into /sys, /dev, and hostPath for kernel-module loading.
  • network-operator — installs RDMA/NIC drivers; the driver DaemonSet needs hostNetwork plus kernel-module privileges.
  • nvsentinel — health and observability daemon that reads the kernel ring buffer, host log paths, and hardware sysfs entries.
  • nvidia-dra-driver-gpu — Dynamic Resource Allocation plugin reads/writes CDI device manifests under /var/run/cdi, requiring hostPath and root.
  • nodewright-operator — controller managing kernel-tuning Customization CRDs. The operator itself is the gate for privileged actions on managed nodes.

Pod Security Standards label set

Each generated Namespace carries:

1pod-security.kubernetes.io/enforce: privileged
2pod-security.kubernetes.io/enforce-version: latest
3pod-security.kubernetes.io/audit: privileged
4pod-security.kubernetes.io/audit-version: latest
5pod-security.kubernetes.io/warn: privileged
6pod-security.kubernetes.io/warn-version: latest
7app.kubernetes.io/managed-by: aicr
8app.kubernetes.io/component: <component-name>
9aicr.run/os: talos

Setting all three of enforce, audit, and warn to privileged keeps audit logs and API-server warnings consistent with what’s actually being enforced. The AICR-managed labels make these namespaces selectable for fleet-wide audits (“which privileged namespaces in this cluster are AICR-managed?”).

Background:

Apply ordering

For each affected component the bundle contains:

NNN-<component>-pre/ # Namespace + PSA labels (sync-wave N-1 in Argo CD)
(NNN+1)-<component>/ # the chart (sync-wave N in Argo CD)

The bundler emits the -pre folder ahead of the primary folder in the local directory layout, and Argo CD’s sync-wave is the folder index, so:

  • Helm deployer: the generated install.sh iterates folders in order, so helm install for the namespace happens before the chart install. No operator action needed.
  • Argo CD deployer: each folder becomes an Application with argocd.argoproj.io/sync-wave: "<index>". The pre-folder has the lowest wave, so Argo applies it first. No operator action needed.

What the mixin does NOT cover

The following privileged components are intentionally not in the mixin. If you hit PSA rejection on one of them when deploying on Talos, please open an issue against #565:

  • aws-ebs-csi-driver, aws-efa — cloud-specific drivers, belong in a per-cloud mixin (future work).
  • gke-nccl-tcpxo — GKE-specific NIC tuning, same reasoning.
  • nodewright-customizations — the Customization CRDs the nodewright-operator manages; out of scope until their per-node privileged story is settled.
  • kube-prometheus-stack — only the node-exporter daemon needs privileged. The right fix is a chart-level override (nodeExporter.enabled: false or a custom Helm value pointing the daemon at a separate namespace), not a whole-chart namespace move.

Snapshot agent on Talos

The aicr snapshot command’s agent pod has its own Talos handling that is separate from this mixin. The agent’s OS=talos pod-shape branch in pkg/k8s/agent/job.go skips the /run/systemd and /etc/os-release hostPath mounts because Talos has no systemd D-Bus. See PR #714 for the agent-side history and tools/talos-test/ for the local-cluster test harness.

References