Testing AICR

View as Markdown

AICR’s test pyramid has five layers. Unit tests are the broad base — table-driven, hermetic, --no-cluster. Above them sit integration tests against a real Kubernetes API (Kind), Chainsaw post-deploy health checks, KWOK matrix tests that exercise scheduling shape and deployer output without GPU hardware, and a thin top of E2E tests against real cloud accounts.

What to use when. If the code path can be exercised without the Kubernetes API, write a unit test. If it cannot, prefer KWOK or Chainsaw over Kind, and Kind over an E2E.

The pre-push gate is make qualify. It runs tests with the race detector and coverage threshold, lints (golangci-lint + yamllint), e2e, vulnerability scan, BOM regen check (opt-in flag elsewhere), and license check. CI runs the equivalent — if make qualify passes locally, CI will pass.

Test Surfaces

SurfaceWhen to useLives inRun locallyGated by
Unit tests (Go)Logic exercisable without K8s API*_test.go next to sourcemake testmake qualify, push CI
Integration tests (Go)Logic touching the K8s API*_test.go with envtest / fake clientmake test (Kind for live cases)make qualify, push CI
Chainsaw health checksComponent-level post-deploy healthrecipes/checks/<name>/health-check.yamlmake check-health COMPONENT=<name>Bundle-validate workflow
KWOK matrix testsRecipe scheduling shape + deployer output without GPUskwok/scripts/*, recipes/overlays/*make kwok-test-deployer RECIPE=… DEPLOYER=…kwok-recipes.yaml workflow
E2E testsFull pipeline against real cloud accountstools/e2eunset GITLAB_TOKEN && ./tools/e2emake qualify, e2e workflow

The rest of this page covers each surface in the order a typical change touches them — unit, integration, chainsaw, KWOK, E2E — plus the make qualify gate and common gotchas.

Unit Tests

Unit tests are the default. Required patterns from CLAUDE.md: table-driven cases, race detector enabled, no live cluster access.

Table-driven (mandatory for multiple cases):

1func TestParseCriteria(t *testing.T) {
2 tests := []struct {
3 name string
4 input string
5 want string
6 wantErr bool
7 }{
8 {"valid h100", "h100", "h100", false},
9 {"empty rejected", "", "", true},
10 }
11 for _, tt := range tests {
12 t.Run(tt.name, func(t *testing.T) {
13 got, err := Parse(tt.input)
14 if (err != nil) != tt.wantErr {
15 t.Fatalf("err=%v wantErr=%v", err, tt.wantErr)
16 }
17 if got != tt.want {
18 t.Errorf("got=%q want=%q", got, tt.want)
19 }
20 })
21 }
22}

Race detector is always on:

$go test -race ./... # full module
$go test -race -v ./pkg/recipe/... # single package
$go test -race -v ./pkg/recipe -run TestX # single test

CLI tests capture cmd.Writer, not stdout. CLI commands write through cmd.Root().Writer so tests can intercept output:

1buf := &bytes.Buffer{}
2cmd := newRecipeCmd(client)
3cmd.SetOut(buf)
4cmd.SetArgs([]string{"--service", "eks", "--accelerator", "h100"})
5if err := cmd.Execute(); err != nil {
6 t.Fatalf("execute: %v", err)
7}

Direct fmt.Println / fmt.Printf to stdout in pkg/cli breaks this pattern and is a review-blocker.

Coverage floor: 75% (project-wide, from .settings.yaml quality.coverage_threshold). make test-coverage enforces it. Per-package decreases > 0.5% are flagged for justification.

Test Isolation (--no-cluster)

Any test that touches the validator or a collector must run with --no-cluster set. Two reasons:

  1. Hermeticity. A test that happens to have a kubeconfig pointed at a real cluster will connect to it and create resources. CI runners and laptops both fail this way silently otherwise.
  2. Speed. No RBAC creation, no Job deployment, no waiting on pods.

How to set it:

1// Go unit/integration tests
2v := validator.New(
3 validator.WithNoCluster(true),
4 validator.WithVersion(version),
5)
$# CLI invocations in tests, scripts, and chainsaw
$aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster
1# Chainsaw scripts: always include --no-cluster on aicr invocations
2- script:
3 content: |
4 ${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-cluster

Behavior with NoCluster=true: the validator skips ServiceAccount / Role / ClusterRole creation, skips Job deployment for container-per-validator checks, and reports each check as "skipped — no-cluster mode (test mode)". Constraints are still evaluated inline, because constraint evaluation reads the snapshot rather than the API server. A test that asserts on validator output must therefore assert on the constraint results, not on check results.

The option lives in pkg/validator/options.go (WithNoCluster). Adding a new validator entry point that talks to the API must respect this flag — the anti-patterns table treats live cluster access in tests as a review-blocker.

Coverage Gate Workflow

Before pushing a Go change, verify the per-package coverage delta on the narrowest directory root your change touches. $pkg/... includes descendants — pick the narrowest root that covers the diff.

$# 1. Profile the working tree (changes must be committed first).
$GOFLAGS="-mod=vendor" go test -coverprofile=cover.out ./pkg/recipe/...
$
$# 2. Profile origin/main from a clean worktree, outside the source tree.
$git worktree add $TMPDIR/baseline origin/main \
> && (cd $TMPDIR/baseline && GOFLAGS="-mod=vendor" go test \
> -coverprofile=$TMPDIR/base.out ./pkg/recipe/...); \
> rc=$?; git worktree remove --force $TMPDIR/baseline; \
> (exit $rc)
$
$# 3. Compare totals.
$go tool cover -func=$TMPDIR/base.out | tail -1
$go tool cover -func=cover.out | tail -1

Writing the baseline profile to $TMPDIR/base.out is deliberate: a profile inside the worktree disappears with worktree remove --force.

Gates:

  • Block if make test-coverage fails (project-wide 75% floor).
  • Block if any new exported function or method has 0% coverage in the diff — add tests before pushing.
  • Flag any per-package decrease > 0.5% and explain in the PR.

Report the delta in the PR Testing section, e.g., pkg/recipe: 90.4% → 90.3% (-0.1%). CI also posts per-package deltas via go-coverage-report after push, but the local gate catches regressions before push.

Integration Tests (Kind)

When a test needs a real Kubernetes API — controller logic, RBAC behavior, watch semantics — use Kind. make dev-env spins up a Kind cluster and starts Tilt; make dev-env-clean tears it down. Prefer controller-runtime’s envtest (local apiserver/etcd, no Kind) for unit-scope controller tests; reserve Kind for cross-package or deployer-output flows that need a full cluster.

$make dev-env # Kind + Tilt running
$make dev-env-clean # delete cluster, stop Tilt

Live-cluster tests still set --no-cluster when the call path includes the validator — the flag suppresses validator RBAC and Job deployment; cluster-touching code under test is unaffected.

Chainsaw Health Checks

Each component in the registry can carry an optional Chainsaw assert YAML that runs after the bundle deploys to confirm the component is healthy: pods Ready, services resolve, CRDs Established, custom resources reach a known phase. The asserts live alongside the component:

recipes/checks/<component>/health-check.yaml

Discover them and run locally against a Kind cluster:

$make check-health COMPONENT=gpu-operator # single component
$make check-health-all # all components
$make validate-local RECIPE=recipe.yaml # full pipeline

When to add one: the contributor wants to verify, after deploy, that the component’s pods reach Ready, its CRDs are Established, its operator deploys its custom resources, or its services resolve. Anything subtler than “is it alive” belongs in a container-per-validator check (see /aicr/contributor-guide/validators) so the assertion runs as part of a tracked validation phase rather than a one-shot Chainsaw step.

Always include --no-cluster on aicr invocations inside the chainsaw script. A chainsaw test that runs aicr validate without the flag will attempt to install RBAC into the test cluster, which diverges from the assertion’s intent (the validator job is the unit under test elsewhere).

Components with an assert file today include gpu-operator, network-operator, nfd, cert-manager, nvsentinel, kueue, kubeflow-trainer, slinky-slurm, and more — browse recipes/checks/ for the full list.

KWOK Matrix Testing

KWOK (Kubernetes WithOut Kubelet) simulates a GPU cluster without real hardware. CI uses it to validate two things per recipe:

  1. Scheduling shape. Node selectors, tolerations, and resource requests render correctly and land pods on the simulated nodes the overlay expects.
  2. Deployer output correctness. The same recipe is re-rendered through every output adapter — helm, argocd-oci, argocd-helm-oci, flux-oci, flux-git — and each renders to a working bundle the GitOps controllers can reconcile.

KWOK nodes have no kubelet, so pods never actually run. KWOK testing does not exercise runtime validators (NCCL, inference-perf) or component health — for those, see /aicr/contributor-guide/validators. If a recipe’s constraints reference dimensions the KWOK node profiles do not provide (e.g., a GPU model no profile registers), the bundle renders but pods stay Pending or land on the wrong nodes. Extend kwok/profiles/ rather than relax the recipe — KWOK is the simulated reflection of production shape, not a relaxed substitute.

For the design rationale and the spike findings that justify the chart pin and Repository-secret shape, see ADR-008; for the Git-source lanes (in-cluster Gitea, flux-git), see ADR-010. For cluster-level KWOK setup (node profiles, recipe auto-discovery), see kwok/README.md.

Deployer Coverage Matrix

TierTriggerDeployers exercised
Tier 1 — generic overlaysevery PR + pushhelm, argocd-oci, argocd-helm-oci, flux-oci, flux-git
Tier 2 — diff-aware accelerator overlaysPR only, conditional on changed fileshelm only
Tier 3 — full overlay setpush to main + nightly schedulehelm, argocd-oci, argocd-helm-oci, flux-oci, flux-git
LanePull artifactApply manifestsReconcile to Ready
helmn/a (filesystem)helm installpods scheduled
argocd-ocirepo-server OCI pullArgo CD syncSynced+Healthy
argocd-helm-ocihelm pull OCIwrapper chart installSynced+Healthy
flux-ocisource-controller OCI pullkustomize-controller applyall HelmReleases Ready=True + ArtifactGenerators Ready
flux-gitsource-controller Git clone (in-cluster Gitea)kustomize-controller applyGitRepositories Ready + all HelmReleases Ready=True

Tier 2 stays helm-only because its job is to verify accelerator-specific overlays still render correctly when their inputs change. The deployer shape is orthogonal — re-running through Argo CD would only re-exercise template rendering, which Tier 1 and Tier 3 already cover on the generic overlays. The flux-git lane covers the filesystem (Git-source) round-trip half of #963; the argocd-git half is a follow-up on the same Gitea infrastructure.

Running KWOK Locally

$unset GITLAB_TOKEN
$make build
$make kwok-cluster
$
$# Single recipe + single deployer
$make kwok-test-deployer RECIPE=eks-training DEPLOYER=argocd-oci

Valid DEPLOYER values: helm, argocd-oci, argocd-helm-oci, flux-oci, flux-git. The target invokes kwok/scripts/run-all-recipes.sh --deployer <name> <recipe>, which calls install-infra.sh once with DEPLOYER exported (in-cluster registry:2 always; Argo CD for argocd-*; Flux 2 controllers for flux-*; Gitea additionally for flux-git), then runs validate-scheduling.sh for the recipe.

Registry host port. The Kind cluster exposes the in-cluster registry:2 Service on host port 5500 (kwok/kind-config.yaml’s extraPortMappings). 5500 avoids Apple ControlCenter (AirPlay / Handoff) which listens on host port 5000 by default on macOS. Linux runners have 5500 free too. The in-cluster NodePort (30500) and Service containerPort (5000) are hardcoded and independent — Argo CD’s repo-server reaches the registry via Service DNS (registry.aicr-registry.svc.cluster.local:5000) regardless.

Gitea host port (flux-git). The same pattern exposes the in-cluster Gitea on host port 3300 (NodePort 30300) so the runner can git push the filesystem bundle; 3300 avoids Gitea’s default 3000, commonly held by Grafana / local dev servers. Flux’s source-controller clones via Service DNS (gitea.aicr-registry.svc.cluster.local:3000). Clusters created before the 3300 mapping existed must be recreated (kind delete cluster --name aicr-kwok-test) — install-infra.sh exit code 71 is the telltale.

Sweeping all deployers locally. make kwok-test-all defaults to helm; there is no matrix-aware make target. Loop in shell:

$for d in helm argocd-oci argocd-helm-oci flux-oci flux-git; do
$ make kwok-test-deployer RECIPE=eks-training DEPLOYER="$d" || break
$done
$
$# Full recipe set under a single deployer:
$bash kwok/scripts/run-all-recipes.sh --deployer argocd-oci

Failure Modes and Exit Codes

The three scripts emit distinct exit codes so CI, the Make target, and local loops can branch on failure mode without parsing logs.

ScriptCodeMeaning
install-infra.sh10yq missing or .settings.yaml field absent
install-infra.sh20Registry Deployment not Ready within 120 s
install-infra.sh21Registry not reachable on host port within 60 s
install-infra.sh30Argo CD Helm install failed
install-infra.sh31applications.argoproj.io CRD not Established within 120 s
install-infra.sh40Repository secret apply failed
install-infra.sh60Flux install manifest apply failed
install-infra.sh61Flux controller not Ready within 180 s
install-infra.sh62Flux CRDs not Established within 60 s
install-infra.sh70Gitea Deployment not Ready within 120 s
install-infra.sh71Gitea not reachable on host port within 60 s (cluster likely predates the 3300 port mapping)
install-infra.sh72Gitea admin user bootstrap failed
validate-scheduling.sh50GitOps sync deadline hit
run-all-recipes.sh50Three consecutive GitOps sync timeouts; ADR-008 3-strike rule tripped

Exit code 50 is distinct so the 3-strike rule in run-all-recipes.sh counts only sync-deadline strikes, not bundle-render or scheduling-shape failures.

Tuning the Sync Deadline

Four environment variables shape how long the GitOps lanes wait before declaring a sync timeout. Argo CD and Flux pairs are independent.

VariableDefaultPurpose
KWOK_ARGOCD_SYNC_TIMEOUT300 sDeadline for all child Argo CD Applications to reach Synced+Healthy
KWOK_ARGOCD_ROOT_GRACE30 sGrace period for the root Application before deadline counting starts
KWOK_FLUX_SYNC_TIMEOUT500 sDeadline for source fetch (OCIRepository or GitRepository) + Kustomization apply + HelmReleases Ready=True + ArtifactGenerators Ready
KWOK_FLUX_ROOT_GRACE30 sGrace period for the outer Kustomization before deadline counting starts

The flux-git lane additionally honors KWOK_GITEA_HOST_PORT (default 3300), KWOK_GITEA_USER (default aicr), and KWOK_GITEA_PASSWORD (default aicr-kwok-ci) — shared between install-infra.sh (Gitea install + admin bootstrap) and validate-scheduling.sh (git push). The password is a CI-only credential for the ephemeral in-cluster Gitea, not a secret.

On a clean local Kind cluster Synced+Healthy lands in ~30 s; the 300-second default exists to absorb CI variance. If a local run trips code 50 but the cluster is otherwise healthy, raise the relevant timeout before assuming the recipe is broken — cold-cluster image pulls are the most common cause.

Debugging CI Failures

When kwok-test fails, it uploads an artifact named kwok-debug-<recipe>-<deployer>-<run_id> containing:

  • <cluster>-resources.txt, <cluster>-nodes.txt, <cluster>-pods.txt, <cluster>-events.txt
  • <cluster>-argo-apps.yaml plus the repo-server and application-controller logs (argocd lanes)
  • <cluster>-flux-resources.yaml (OCIRepositories, GitRepositories, Kustomizations, HelmReleases, ArtifactGenerators, ExternalArtifacts) plus source-, kustomize-, and helm-controller logs (flux-* lanes)
  • <cluster>-registry.log — last 200 lines of the in-cluster registry:2
  • <cluster>-gitea.log — last 200 lines of the in-cluster Gitea (flux-git lane)

Start with the repo-server log (Argo CD) or source-controller log (Flux) for OCI-pull failures. Application-controller / kustomize-controller logs show reconciliation decisions and prune behavior; helm-controller logs surface per-HelmRelease install outcomes.

Adding a New Deployer Value

The deployer set is finite and matches what pkg/bundler emits. To add a new value (say, argocd-git):

  1. Add a case branch in kwok/scripts/validate-scheduling.sh’s resolve_argocd_root_app() (or resolve_flux_root_names() for Flux-reconciled lanes), plus branches in generate_bundle and deploy_bundle. Reuse the existing argocd-oci / flux-oci / flux-git branches as templates (flux-git is the closest model for Git-source lanes — Gitea install, dual-view URLs, push-to-create).
  2. Extend the DEPLOYER allowlist in kwok/scripts/run-all-recipes.sh (case "$DEPLOYER" in in main()).
  3. Extend the case "${DEPLOYER}" branches in install-infra.sh’s main() so the right controller stack is installed.
  4. Extend the deployer: input description in .github/actions/kwok-test/action.yml.
  5. Add the value to the deployer: matrix in Tier 1 and Tier 3 of .github/workflows/kwok-recipes.yaml. Leave Tier 2 alone — the orthogonality rationale above still applies.
  6. Add a row to the Deployer Coverage Matrix above so contributors can discover the new lane.

If the new value requires changes to in-cluster infra (different registry, different Argo CD chart, additional CRDs), update install-infra.sh and pin any new versions in .settings.yaml. The exit-code taxonomy in Failure Modes and Exit Codes is contiguous — pick the next free code if a new distinct failure mode appears.

E2E Tests

./tools/e2e is the end-to-end pipeline runner. It builds, snapshots, generates recipes, validates, bundles, and (when credentials are available) deploys against real cloud accounts.

$unset GITLAB_TOKEN
$./tools/e2e

make qualify invokes the e2e step as part of the pre-push gate. CI runs the same script in the push workflow. Cloud credentials are optional — without them, the e2e exercises the artifact-generation half of the pipeline and skips deploy-side assertions.

The make qualify Gate

make qualify is the canonical pre-push command. It runs:

  • test-coveragego test -race ./... plus the 75% coverage floor.
  • lint — golangci-lint with .golangci.yaml plus yamllint.
  • e2e — the end-to-end pipeline runner.
  • scan — Grype vulnerability scan.
  • license-check — license header / dependency-license sweep.

CI runs the equivalent. If make qualify passes locally on the current branch, push CI will pass.

Branch lint gate for Go changes. If a PR changes any .go file, you must also run:

$golangci-lint run -c .golangci.yaml ./pkg/<affected>/...
$golangci-lint run -c .golangci.yaml ./... # full sweep

This applies even to PRs labeled documentation when they include incidental Go changes. Do not rely on CI to surface lint failures — the pre-push gate is local.

Common Gotchas

  • goreleaser fails when both GITLAB_TOKEN and GITHUB_TOKEN are set. make build, make qualify, and ./tools/e2e all invoke goreleaser indirectly. Always unset GITLAB_TOKEN in the shell first. This is one of the most common local-only CI-passes-fine failure modes.
  • Forgetting make bom-docs after a recipes/registry.yaml, component values, or chart-pin change. docs/user/container-images.md goes stale silently — make bom-check is opt-in only and not wired into make qualify, make lint, or the merge gate today. CI does not catch this. Run make bom-docs locally any time the change touches charts.
  • Coverage decrease > 0.5% blocks the PR. Add tests rather than reaching for // nolint or t.Skip — both are review-blockers under the no-skip-tests rule in CLAUDE.md.
  • Live-cluster connections from unit tests. A test that forgets --no-cluster will attach to whichever kubeconfig is current and create RBAC against it. Always pass WithNoCluster(true) (Go) or --no-cluster (CLI / chainsaw) on the validator path.
  • CLI tests asserting on stdout. pkg/cli writes through cmd.Root().Writer. A test that captures os.Stdout will see nothing. Use cmd.SetOut(buf) and assert on buf.String().

See Also