KWOK Deployer Matrix Testing
KWOK (Kubernetes WithOut Kubelet) simulates a GPU cluster without real
hardware so CI can validate scheduling shape — node selectors,
tolerations, resource requests — for every recipe. The deployer
matrix extends that coverage by re-running the same recipes through
three additional output adapters (argocd-oci, argocd-helm-oci,
flux-oci) in addition to the existing helm path. This catches
Argo CD / Flux template regressions and OCI-source compatibility
breaks that the helm-only lane could never see.
For the design rationale and the spike findings that justify the chart pin and Repository-secret shape, see ADR-008. For the cluster-level KWOK setup (node profiles, recipe auto-discovery), see kwok/README.md.
Deployer Coverage Matrix
What each lane validates
flux-oci validates full reconciliation: OCIRepository pulled, Kustomization applied, all HelmReleases reach Ready=True, and when local-chart components are present, ArtifactGenerator CRs (source.extensions.fluxcd.io/v1beta1) reach Ready=True. ArtifactGenerators extract local-chart sub-directories from the outer OCIRepository into ExternalArtifact resources, which HelmReleases reference via spec.chartRef. This requires source-watcher (installed via flux install --components-extra=source-watcher) and the ExternalArtifact=true feature gate on helm-controller.
For filesystem (Git-source) round-trip coverage of argocd / flux, see #963.
Tier 2 stays helm-only because its job is to verify that an
accelerator-specific overlay still produces a correct bundle when the
diff touches that overlay’s inputs. The deployer shape is orthogonal
to that question — re-running the same recipe under Argo CD would only
re-exercise template rendering, which Tier 1 and Tier 3 already cover
on the generic overlays.
Running Locally
make kwok-test-deployer is the single entry point that mirrors what
CI runs per matrix cell. It expects a KWOK cluster to already exist.
Valid DEPLOYER values: helm, argocd-oci, argocd-helm-oci,
flux-oci. The target invokes kwok/scripts/run-all-recipes.sh --deployer <name> <recipe>, which calls install-infra.sh once with
DEPLOYER exported (in-cluster registry:2 always; Argo CD for
argocd-*; Flux 2 controllers for flux-oci), then runs
validate-scheduling.sh for the recipe.
Registry host port
The Kind cluster exposes the in-cluster registry:2 Service on host
port 5500 (kwok/kind-config.yaml’s extraPortMappings). The
unconventional choice avoids Apple ControlCenter (AirPlay / Handoff),
which listens on host port 5000 by default on macOS and would otherwise
fail kind create cluster with a port-bind error. Linux CI runners have
5500 free as well, so the same default works everywhere.
The in-cluster NodePort (30500) and the Service containerPort
(5000) are hardcoded and independent of the host-side mapping —
Argo CD’s repo-server reaches the registry via Service DNS
(registry.aicr-registry.svc.cluster.local:5000) regardless.
Override KWOK_REGISTRY_HOST_PORT only when running against a
non-standard cluster topology (e.g., port 5500 is already in use on
your host). The variable adjusts the reachability probe; you still need
to update kwok/kind-config.yaml’s hostPort to match.
Sweeping All Deployers Locally
make kwok-test-all still defaults to helm and there is no
matrix-aware make target — CI does the fan-out via the workflow
matrix. To reproduce the matrix locally, loop in shell:
For the full recipe set under a single deployer, call the script directly:
Failure Modes and Exit Codes
The three scripts emit distinct exit codes so callers (CI, the Make target, local loops) can branch on failure mode rather than parsing logs.
Exit code 50 from validate-scheduling.sh is intentionally distinct
from generic non-zero exits: a sync-deadline timeout is qualitatively
different from a bundle-render failure or a scheduling-shape mismatch,
and the 3-strike rule in run-all-recipes.sh only counts code-50
strikes.
Tuning the Sync Deadline
Four environment variables shape how long the GitOps lanes wait before declaring a sync timeout. The Argo CD pair is independent of the Flux pair so the two GitOps lanes can be tuned separately when a recipe has deployer-specific reconciliation overhead.
On a clean local Kind cluster the Phase-0 spike observed
Synced+Healthy in roughly 30 seconds. CI runners are slower and
contended, so the 300-second default exists to absorb that variance.
If a local run trips code 50 but the cluster is otherwise healthy,
raise KWOK_ARGOCD_SYNC_TIMEOUT / KWOK_FLUX_SYNC_TIMEOUT before
assuming the recipe is broken — image pulls on a cold cluster are the
most common cause.
Debugging CI Failures
When the kwok-test composite action fails, it uploads an artifact
named kwok-debug-<recipe>-<deployer>-<run_id> containing:
<cluster>-resources.txt—kubectl get all --all-namespaces<cluster>-nodes.txt—kubectl get nodes -o wide<cluster>-events.txt— events sorted bylastTimestamp<cluster>-pods.txt— pod listing across all namespaces<cluster>-argo-apps.yaml—applications.argoproj.ioYAML inargocdnamespace (argocd lanes only)<cluster>-argo-reposerver.log— last 500 lines ofargocd-repo-server<cluster>-argo-appcontroller.log— last 500 lines ofargocd-application-controller<cluster>-flux-resources.yaml— YAML forocirepositories,kustomizations,helmreleases,artifactgenerators, andexternalartifactsacross all namespaces (flux-oci lane only)<cluster>-flux-source-controller.log— last 500 lines ofsource-controller<cluster>-flux-kustomize-controller.log— last 500 lines ofkustomize-controller<cluster>-flux-helm-controller.log— last 500 lines ofhelm-controller<cluster>-registry.log— last 200 lines of the in-clusterregistry:2Deployment
The repo-server log (Argo CD) and source-controller log (Flux) are
the first places to look for OCI-pull failures (plain-HTTP scheme
errors, mediaType mismatches). The application-controller (Argo CD)
and kustomize-controller (Flux) logs show reconciliation decisions
and prune behavior; helm-controller logs surface per-HelmRelease
install and upgrade outcomes.
Adding a New Deployer Value
The deployer set is intentionally finite and matches what
pkg/bundler emits. To add a new value (say, argocd-git):
- Add a
casebranch inkwok/scripts/validate-scheduling.sh’sresolve_argocd_root_app()(or the equivalentresolve_flux_root_names()if the new lane reconciles via Flux) mapping the new deployer to its root-resource name, plus any new branches ingenerate_bundleanddeploy_bundle. Reuse the existingargocd-oci/flux-ocibranches as templates. - Extend the
DEPLOYERallowlist inkwok/scripts/run-all-recipes.sh(thecase "$DEPLOYER" inblock inmain()). - Extend the
case "$\{DEPLOYER\}"branches inkwok/scripts/install-infra.sh’smain()so the right controller stack is installed for the new deployer. - Extend the
deployer:input description in.github/actions/kwok-test/action.ymlso callers see the new value in the input docs. - Add the value to the
deployer:matrix in Tier 1 and Tier 3 of.github/workflows/kwok-recipes.yaml. Leave Tier 2 alone — the orthogonality rationale above still applies. - Add a row to the Deployer Coverage Matrix table on this page so contributors can discover the new lane.
If the new value requires changes to the in-cluster infra (a different
registry, a different Argo CD chart, additional CRDs), update
install-infra.sh and pin any new versions in .settings.yaml rather
than hardcoding. The exit-code taxonomy in
Failure Modes and Exit Codes is
contiguous — pick the next free code if a new distinct failure mode
appears.