KWOK (Kubernetes WithOut Kubelet) simulates a GPU cluster without real
hardware so CI can validate scheduling shape — node selectors,
tolerations, resource requests — for every recipe. The deployer
matrix extends that coverage by re-running the same recipes through
three additional output adapters (argocd-oci, argocd-helm-oci,
flux-oci) in addition to the existing helm path. This catches
Argo CD / Flux template regressions and OCI-source compatibility
breaks that the helm-only lane could never see.
For the design rationale and the spike findings that justify the chart pin and Repository-secret shape, see ADR-008. For the cluster-level KWOK setup (node profiles, recipe auto-discovery), see kwok/README.md.
flux-oci validates full reconciliation: OCIRepository pulled, Kustomization applied, all HelmReleases reach Ready=True, and when local-chart components are present, ArtifactGenerator CRs (source.extensions.fluxcd.io/v1beta1) reach Ready=True. ArtifactGenerators extract local-chart sub-directories from the outer OCIRepository into ExternalArtifact resources, which HelmReleases reference via spec.chartRef. This requires source-watcher (installed via flux install --components-extra=source-watcher) and the ExternalArtifact=true feature gate on helm-controller.
For filesystem (Git-source) round-trip coverage of argocd / flux, see #963.
Tier 2 stays helm-only because its job is to verify that an
accelerator-specific overlay still produces a correct bundle when the
diff touches that overlay’s inputs. The deployer shape is orthogonal
to that question — re-running the same recipe under Argo CD would only
re-exercise template rendering, which Tier 1 and Tier 3 already cover
on the generic overlays.
make kwok-test-deployer is the single entry point that mirrors what
CI runs per matrix cell. It expects a KWOK cluster to already exist.
Valid DEPLOYER values: helm, argocd-oci, argocd-helm-oci,
flux-oci. The target invokes kwok/scripts/run-all-recipes.sh --deployer <name> <recipe>, which calls install-infra.sh once with
DEPLOYER exported (in-cluster registry:2 always; Argo CD for
argocd-*; Flux 2 controllers for flux-oci), then runs
validate-scheduling.sh for the recipe.
The Kind cluster exposes the in-cluster registry:2 Service on host
port 5500 (kwok/kind-config.yaml’s extraPortMappings). The
unconventional choice avoids Apple ControlCenter (AirPlay / Handoff),
which listens on host port 5000 by default on macOS and would otherwise
fail kind create cluster with a port-bind error. Linux CI runners have
5500 free as well, so the same default works everywhere.
The in-cluster NodePort (30500) and the Service containerPort
(5000) are hardcoded and independent of the host-side mapping —
Argo CD’s repo-server reaches the registry via Service DNS
(registry.aicr-registry.svc.cluster.local:5000) regardless.
Override KWOK_REGISTRY_HOST_PORT only when running against a
non-standard cluster topology (e.g., port 5500 is already in use on
your host). The variable adjusts the reachability probe; you still need
to update kwok/kind-config.yaml’s hostPort to match.
make kwok-test-all still defaults to helm and there is no
matrix-aware make target — CI does the fan-out via the workflow
matrix. To reproduce the matrix locally, loop in shell:
For the full recipe set under a single deployer, call the script directly:
The three scripts emit distinct exit codes so callers (CI, the Make target, local loops) can branch on failure mode rather than parsing logs.
Exit code 50 from validate-scheduling.sh is intentionally distinct
from generic non-zero exits: a sync-deadline timeout is qualitatively
different from a bundle-render failure or a scheduling-shape mismatch,
and the 3-strike rule in run-all-recipes.sh only counts code-50
strikes.
Four environment variables shape how long the GitOps lanes wait before declaring a sync timeout. The Argo CD pair is independent of the Flux pair so the two GitOps lanes can be tuned separately when a recipe has deployer-specific reconciliation overhead.
On a clean local Kind cluster the Phase-0 spike observed
Synced+Healthy in roughly 30 seconds. CI runners are slower and
contended, so the 300-second default exists to absorb that variance.
If a local run trips code 50 but the cluster is otherwise healthy,
raise KWOK_ARGOCD_SYNC_TIMEOUT / KWOK_FLUX_SYNC_TIMEOUT before
assuming the recipe is broken — image pulls on a cold cluster are the
most common cause.
When the kwok-test composite action fails, it uploads an artifact
named kwok-debug-<recipe>-<deployer>-<run_id> containing:
<cluster>-resources.txt — kubectl get all --all-namespaces<cluster>-nodes.txt — kubectl get nodes -o wide<cluster>-events.txt — events sorted by lastTimestamp<cluster>-pods.txt — pod listing across all namespaces<cluster>-argo-apps.yaml — applications.argoproj.io YAML in argocd namespace (argocd lanes only)<cluster>-argo-reposerver.log — last 500 lines of argocd-repo-server<cluster>-argo-appcontroller.log — last 500 lines of argocd-application-controller<cluster>-flux-resources.yaml — YAML for ocirepositories, kustomizations, helmreleases, artifactgenerators, and externalartifacts across all namespaces (flux-oci lane only)<cluster>-flux-source-controller.log — last 500 lines of source-controller<cluster>-flux-kustomize-controller.log — last 500 lines of kustomize-controller<cluster>-flux-helm-controller.log — last 500 lines of helm-controller<cluster>-registry.log — last 200 lines of the in-cluster registry:2 DeploymentThe repo-server log (Argo CD) and source-controller log (Flux) are
the first places to look for OCI-pull failures (plain-HTTP scheme
errors, mediaType mismatches). The application-controller (Argo CD)
and kustomize-controller (Flux) logs show reconciliation decisions
and prune behavior; helm-controller logs surface per-HelmRelease
install and upgrade outcomes.
The deployer set is intentionally finite and matches what
pkg/bundler emits. To add a new value (say, argocd-git):
case branch in kwok/scripts/validate-scheduling.sh’s
resolve_argocd_root_app() (or the equivalent resolve_flux_root_names()
if the new lane reconciles via Flux) mapping the new deployer to its
root-resource name, plus any new branches in generate_bundle and
deploy_bundle. Reuse the existing argocd-oci / flux-oci branches
as templates.DEPLOYER allowlist in kwok/scripts/run-all-recipes.sh
(the case "$DEPLOYER" in block in main()).case "${DEPLOYER}" branches in
kwok/scripts/install-infra.sh’s main() so the right controller
stack is installed for the new deployer.deployer: input description in
.github/actions/kwok-test/action.yml so callers see the new
value in the input docs.deployer: matrix in Tier 1 and Tier 3 of
.github/workflows/kwok-recipes.yaml. Leave Tier 2 alone — the
orthogonality rationale above still applies.If the new value requires changes to the in-cluster infra (a different
registry, a different Argo CD chart, additional CRDs), update
install-infra.sh and pin any new versions in .settings.yaml rather
than hardcoding. The exit-code taxonomy in
Failure Modes and Exit Codes is
contiguous — pick the next free code if a new distinct failure mode
appears.