Testing AICR
AICR’s test pyramid has five layers. Unit tests are the broad base —
table-driven, hermetic, --no-cluster. Above them sit integration
tests against a real Kubernetes API (Kind), Chainsaw post-deploy
health checks, KWOK matrix tests that exercise scheduling shape and
deployer output without GPU hardware, and a thin top of E2E tests
against real cloud accounts.
What to use when. If the code path can be exercised without the Kubernetes API, write a unit test. If it cannot, prefer KWOK or Chainsaw over Kind, and Kind over an E2E.
The pre-push gate is make qualify. It runs tests with the race
detector and coverage threshold, lints (golangci-lint + yamllint),
e2e, vulnerability scan, BOM regen check (opt-in flag elsewhere),
and license check. CI runs the equivalent — if make qualify passes
locally, CI will pass.
Test Surfaces
The rest of this page covers each surface in the order a typical
change touches them — unit, integration, chainsaw, KWOK, E2E — plus
the make qualify gate and common gotchas.
Unit Tests
Unit tests are the default. Required patterns from CLAUDE.md: table-driven cases, race detector enabled, no live cluster access.
Table-driven (mandatory for multiple cases):
Race detector is always on:
CLI tests capture cmd.Writer, not stdout. CLI commands write
through cmd.Root().Writer so tests can intercept output:
Direct fmt.Println / fmt.Printf to stdout in pkg/cli breaks
this pattern and is a review-blocker.
Coverage floor: 75% (project-wide, from .settings.yaml
quality.coverage_threshold). make test-coverage enforces it.
Per-package decreases > 0.5% are flagged for justification.
Test Isolation (--no-cluster)
Any test that touches the validator or a collector must run with
--no-cluster set. Two reasons:
- Hermeticity. A test that happens to have a kubeconfig pointed at a real cluster will connect to it and create resources. CI runners and laptops both fail this way silently otherwise.
- Speed. No RBAC creation, no Job deployment, no waiting on pods.
How to set it:
Behavior with NoCluster=true: the validator skips ServiceAccount /
Role / ClusterRole creation, skips Job deployment for container-per-validator
checks, and reports each check as "skipped — no-cluster mode (test mode)". Constraints are still evaluated inline, because constraint
evaluation reads the snapshot rather than the API server. A test that
asserts on validator output must therefore assert on the constraint
results, not on check results.
The option lives in pkg/validator/options.go
(WithNoCluster). Adding a new validator entry point that talks to
the API must respect this flag — the anti-patterns table treats live
cluster access in tests as a review-blocker.
Coverage Gate Workflow
Before pushing a Go change, verify the per-package coverage delta on
the narrowest directory root your change touches. $pkg/... includes
descendants — pick the narrowest root that covers the diff.
Writing the baseline profile to $TMPDIR/base.out is deliberate: a
profile inside the worktree disappears with worktree remove --force.
Gates:
- Block if
make test-coveragefails (project-wide 75% floor). - Block if any new exported function or method has 0% coverage in the diff — add tests before pushing.
- Flag any per-package decrease > 0.5% and explain in the PR.
Report the delta in the PR Testing section, e.g., pkg/recipe: 90.4% → 90.3% (-0.1%). CI also posts per-package deltas via
go-coverage-report after push, but the local gate catches
regressions before push.
Integration Tests (Kind)
When a test needs a real Kubernetes API — controller logic, RBAC
behavior, watch semantics — use Kind. make dev-env spins up a Kind
cluster and starts Tilt; make dev-env-clean tears it down. Prefer
controller-runtime’s envtest (local apiserver/etcd, no Kind) for
unit-scope controller tests; reserve Kind for cross-package or
deployer-output flows that need a full cluster.
Live-cluster tests still set --no-cluster when the call path
includes the validator — the flag suppresses validator RBAC and Job
deployment; cluster-touching code under test is unaffected.
Chainsaw Health Checks
Each component in the registry can carry an optional Chainsaw assert
YAML that runs after the bundle deploys to confirm the component is
healthy: pods Ready, services resolve, CRDs Established, custom
resources reach a known phase. The asserts live alongside the
component:
Discover them and run locally against a Kind cluster:
When to add one: the contributor wants to verify, after deploy,
that the component’s pods reach Ready, its CRDs are Established,
its operator deploys its custom resources, or its services resolve.
Anything subtler than “is it alive” belongs in a container-per-validator
check (see /aicr/contributor-guide/validators) so the assertion runs as part
of a tracked validation phase rather than a one-shot Chainsaw step.
Always include --no-cluster on aicr invocations inside the
chainsaw script. A chainsaw test that runs aicr validate without
the flag will attempt to install RBAC into the test cluster, which
diverges from the assertion’s intent (the validator job is the unit
under test elsewhere).
Components with an assert file today include gpu-operator,
network-operator, nfd, cert-manager, nvsentinel, kueue,
kubeflow-trainer, slinky-slurm, and more — browse
recipes/checks/
for the full list.
KWOK Matrix Testing
KWOK (Kubernetes WithOut Kubelet) simulates a GPU cluster without real hardware. CI uses it to validate two things per recipe:
- Scheduling shape. Node selectors, tolerations, and resource requests render correctly and land pods on the simulated nodes the overlay expects.
- Deployer output correctness. The same recipe is re-rendered
through every output adapter —
helm,argocd-oci,argocd-helm-oci,flux-oci,flux-git— and each renders to a working bundle the GitOps controllers can reconcile.
KWOK nodes have no kubelet, so pods never actually run. KWOK testing
does not exercise runtime validators (NCCL, inference-perf) or
component health — for those, see /aicr/contributor-guide/validators. If a
recipe’s constraints reference dimensions the KWOK node profiles do
not provide (e.g., a GPU model no profile registers), the bundle
renders but pods stay Pending or land on the wrong nodes. Extend
kwok/profiles/ rather than relax the recipe — KWOK is the
simulated reflection of production shape, not a relaxed substitute.
For the design rationale and the spike findings that justify the
chart pin and Repository-secret shape, see
ADR-008;
for the Git-source lanes (in-cluster Gitea, flux-git), see
ADR-010.
For cluster-level KWOK setup (node profiles, recipe auto-discovery),
see kwok/README.md.
Deployer Coverage Matrix
Tier 2 stays helm-only because its job is to verify accelerator-specific
overlays still render correctly when their inputs change. The deployer
shape is orthogonal — re-running through Argo CD would only re-exercise
template rendering, which Tier 1 and Tier 3 already cover on the generic
overlays. The flux-git lane covers the filesystem (Git-source)
round-trip half of #963;
the argocd-git half is a follow-up on the same Gitea infrastructure.
Running KWOK Locally
Valid DEPLOYER values: helm, argocd-oci, argocd-helm-oci,
flux-oci, flux-git. The target invokes
kwok/scripts/run-all-recipes.sh --deployer <name> <recipe>, which
calls install-infra.sh once with DEPLOYER exported (in-cluster
registry:2 always; Argo CD for argocd-*; Flux 2 controllers for
flux-*; Gitea additionally for flux-git), then runs
validate-scheduling.sh for the recipe.
Registry host port. The Kind cluster exposes the in-cluster
registry:2 Service on host port 5500 (kwok/kind-config.yaml’s
extraPortMappings). 5500 avoids Apple ControlCenter
(AirPlay / Handoff) which listens on host port 5000 by default on
macOS. Linux runners have 5500 free too. The in-cluster NodePort
(30500) and Service containerPort (5000) are hardcoded and
independent — Argo CD’s repo-server reaches the registry via Service
DNS (registry.aicr-registry.svc.cluster.local:5000) regardless.
Gitea host port (flux-git). The same pattern exposes the
in-cluster Gitea on host port 3300 (NodePort 30300) so the
runner can git push the filesystem bundle; 3300 avoids Gitea’s
default 3000, commonly held by Grafana / local dev servers. Flux’s
source-controller clones via Service DNS
(gitea.aicr-registry.svc.cluster.local:3000). Clusters created
before the 3300 mapping existed must be recreated
(kind delete cluster --name aicr-kwok-test) — install-infra.sh
exit code 71 is the telltale.
Sweeping all deployers locally. make kwok-test-all defaults to
helm; there is no matrix-aware make target. Loop in shell:
Failure Modes and Exit Codes
The three scripts emit distinct exit codes so CI, the Make target, and local loops can branch on failure mode without parsing logs.
Exit code 50 is distinct so the 3-strike rule in run-all-recipes.sh
counts only sync-deadline strikes, not bundle-render or
scheduling-shape failures.
Tuning the Sync Deadline
Four environment variables shape how long the GitOps lanes wait before declaring a sync timeout. Argo CD and Flux pairs are independent.
The flux-git lane additionally honors KWOK_GITEA_HOST_PORT
(default 3300), KWOK_GITEA_USER (default aicr), and
KWOK_GITEA_PASSWORD (default aicr-kwok-ci) — shared between
install-infra.sh (Gitea install + admin bootstrap) and
validate-scheduling.sh (git push). The password is a CI-only
credential for the ephemeral in-cluster Gitea, not a secret.
On a clean local Kind cluster Synced+Healthy lands in ~30 s; the
300-second default exists to absorb CI variance. If a local run trips
code 50 but the cluster is otherwise healthy, raise the relevant
timeout before assuming the recipe is broken — cold-cluster image
pulls are the most common cause.
Debugging CI Failures
When kwok-test fails, it uploads an artifact named
kwok-debug-<recipe>-<deployer>-<run_id> containing:
<cluster>-resources.txt,<cluster>-nodes.txt,<cluster>-pods.txt,<cluster>-events.txt<cluster>-argo-apps.yamlplus the repo-server and application-controller logs (argocd lanes)<cluster>-flux-resources.yaml(OCIRepositories, GitRepositories, Kustomizations, HelmReleases, ArtifactGenerators, ExternalArtifacts) plus source-, kustomize-, and helm-controller logs (flux-* lanes)<cluster>-registry.log— last 200 lines of the in-clusterregistry:2<cluster>-gitea.log— last 200 lines of the in-cluster Gitea (flux-git lane)
Start with the repo-server log (Argo CD) or source-controller log
(Flux) for OCI-pull failures. Application-controller / kustomize-controller
logs show reconciliation decisions and prune behavior;
helm-controller logs surface per-HelmRelease install outcomes.
Adding a New Deployer Value
The deployer set is finite and matches what pkg/bundler emits. To
add a new value (say, argocd-git):
- Add a
casebranch inkwok/scripts/validate-scheduling.sh’sresolve_argocd_root_app()(orresolve_flux_root_names()for Flux-reconciled lanes), plus branches ingenerate_bundleanddeploy_bundle. Reuse the existingargocd-oci/flux-oci/flux-gitbranches as templates (flux-gitis the closest model for Git-source lanes — Gitea install, dual-view URLs, push-to-create). - Extend the
DEPLOYERallowlist inkwok/scripts/run-all-recipes.sh(case "$DEPLOYER" ininmain()). - Extend the
case "${DEPLOYER}"branches ininstall-infra.sh’smain()so the right controller stack is installed. - Extend the
deployer:input description in.github/actions/kwok-test/action.yml. - Add the value to the
deployer:matrix in Tier 1 and Tier 3 of.github/workflows/kwok-recipes.yaml. Leave Tier 2 alone — the orthogonality rationale above still applies. - Add a row to the Deployer Coverage Matrix above so contributors can discover the new lane.
If the new value requires changes to in-cluster infra (different
registry, different Argo CD chart, additional CRDs), update
install-infra.sh and pin any new versions in .settings.yaml. The
exit-code taxonomy in Failure Modes and Exit Codes
is contiguous — pick the next free code if a new distinct failure
mode appears.
E2E Tests
./tools/e2e is the end-to-end pipeline runner. It builds, snapshots,
generates recipes, validates, bundles, and (when credentials are
available) deploys against real cloud accounts.
make qualify invokes the e2e step as part of the pre-push gate.
CI runs the same script in the push workflow. Cloud credentials are
optional — without them, the e2e exercises the artifact-generation
half of the pipeline and skips deploy-side assertions.
The make qualify Gate
make qualify is the canonical pre-push command. It runs:
test-coverage—go test -race ./...plus the 75% coverage floor.lint— golangci-lint with.golangci.yamlplus yamllint.e2e— the end-to-end pipeline runner.scan— Grype vulnerability scan.license-check— license header / dependency-license sweep.
CI runs the equivalent. If make qualify passes locally on the
current branch, push CI will pass.
Branch lint gate for Go changes. If a PR changes any .go file,
you must also run:
This applies even to PRs labeled documentation when they include
incidental Go changes. Do not rely on CI to surface lint failures —
the pre-push gate is local.
Common Gotchas
goreleaserfails when bothGITLAB_TOKENandGITHUB_TOKENare set.make build,make qualify, and./tools/e2eall invoke goreleaser indirectly. Alwaysunset GITLAB_TOKENin the shell first. This is one of the most common local-only CI-passes-fine failure modes.- Forgetting
make bom-docsafter arecipes/registry.yaml, component values, or chart-pin change.docs/user/container-images.mdgoes stale silently —make bom-checkis opt-in only and not wired intomake qualify,make lint, or the merge gate today. CI does not catch this. Runmake bom-docslocally any time the change touches charts. - Coverage decrease > 0.5% blocks the PR. Add tests rather than
reaching for
// nolintort.Skip— both are review-blockers under the no-skip-tests rule in CLAUDE.md. - Live-cluster connections from unit tests. A test that forgets
--no-clusterwill attach to whichever kubeconfig is current and create RBAC against it. Always passWithNoCluster(true)(Go) or--no-cluster(CLI / chainsaw) on the validator path. - CLI tests asserting on stdout.
pkg/cliwrites throughcmd.Root().Writer. A test that capturesos.Stdoutwill see nothing. Usecmd.SetOut(buf)and assert onbuf.String().
See Also
- contributor index — package layout and the scope boundary
- /aicr/contributor-guide/recipes-overlays-and-mixins — recipe-level constraints and merge tests
- /aicr/contributor-guide/validators — validator engine, chainsaw checks, container-per-validator pattern
- CLAUDE.md — coding rules and anti-patterns table
- ADR-008 — KWOK deployer matrix rationale
- ADR-010 — Git-source lanes (Gitea, flux-git)
- kwok/README.md — KWOK cluster setup and node profiles