For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Overview
    • Introduction
  • User Guide
    • Installation
    • CLI Reference
    • API Reference
    • Agent Deployment
    • Air-Gapped Mirroring
    • Component Catalog
    • Validation
  • Integrator Guide
    • Automation
    • Data Flow
    • Kubernetes Deployment
    • AKS GPU Setup
    • EKS Dynamo Networking
    • GKE TCPXO Networking
    • Recipe Development
    • Data Extension
    • Validator Extension
  • Contributor Guide
    • Architecture Overview
    • CLI
    • API Server
    • API Server Extension Patterns
    • Data Architecture
    • Component Development
    • Validations
    • Validator Development
    • Maintaining Recipe Contributions
    • KWOK Deployer Matrix Testing
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAI Cluster Runtime
On this page
  • Deployer Coverage Matrix
  • What each lane validates
  • Running Locally
  • Registry host port
  • Sweeping All Deployers Locally
  • Failure Modes and Exit Codes
  • Tuning the Sync Deadline
  • Debugging CI Failures
  • Adding a New Deployer Value
Contributor Guide

KWOK Deployer Matrix Testing

||View as Markdown|
Previous

Maintaining Recipe Contributions

KWOK (Kubernetes WithOut Kubelet) simulates a GPU cluster without real hardware so CI can validate scheduling shape — node selectors, tolerations, resource requests — for every recipe. The deployer matrix extends that coverage by re-running the same recipes through three additional output adapters (argocd-oci, argocd-helm-oci, flux-oci) in addition to the existing helm path. This catches Argo CD / Flux template regressions and OCI-source compatibility breaks that the helm-only lane could never see.

For the design rationale and the spike findings that justify the chart pin and Repository-secret shape, see ADR-008. For the cluster-level KWOK setup (node profiles, recipe auto-discovery), see kwok/README.md.

Deployer Coverage Matrix

TierTriggerDeployers exercised
Tier 1 — generic overlaysevery PR + pushhelm, argocd-oci, argocd-helm-oci, flux-oci
Tier 2 — diff-aware accelerator overlaysPR only, conditional on changed fileshelm only
Tier 3 — full overlay setpush to main + nightly schedulehelm, argocd-oci, argocd-helm-oci, flux-oci

What each lane validates

LanePull artifactApply manifestsReconcile to Ready
helmn/a (filesystem)✅ (helm install)✅ pods scheduled
argocd-oci✅ (repo-server OCI pull)✅ (Argo CD sync)✅ Synced+Healthy (or documented partial-pass states)
argocd-helm-oci✅ (helm pull OCI)✅ (wrapper chart install)✅ Synced+Healthy (or documented partial-pass states)
flux-oci✅ (source-controller OCI pull)✅ (kustomize-controller apply)✅ all HelmReleases Ready=True + ArtifactGenerators Ready (when present)

flux-oci validates full reconciliation: OCIRepository pulled, Kustomization applied, all HelmReleases reach Ready=True, and when local-chart components are present, ArtifactGenerator CRs (source.extensions.fluxcd.io/v1beta1) reach Ready=True. ArtifactGenerators extract local-chart sub-directories from the outer OCIRepository into ExternalArtifact resources, which HelmReleases reference via spec.chartRef. This requires source-watcher (installed via flux install --components-extra=source-watcher) and the ExternalArtifact=true feature gate on helm-controller.

For filesystem (Git-source) round-trip coverage of argocd / flux, see #963.

Tier 2 stays helm-only because its job is to verify that an accelerator-specific overlay still produces a correct bundle when the diff touches that overlay’s inputs. The deployer shape is orthogonal to that question — re-running the same recipe under Argo CD would only re-exercise template rendering, which Tier 1 and Tier 3 already cover on the generic overlays.

Running Locally

make kwok-test-deployer is the single entry point that mirrors what CI runs per matrix cell. It expects a KWOK cluster to already exist.

$unset GITLAB_TOKEN
$make build
$make kwok-cluster
$
$# Single recipe + single deployer
$make kwok-test-deployer RECIPE=eks-training DEPLOYER=argocd-oci

Valid DEPLOYER values: helm, argocd-oci, argocd-helm-oci, flux-oci. The target invokes kwok/scripts/run-all-recipes.sh --deployer <name> <recipe>, which calls install-infra.sh once with DEPLOYER exported (in-cluster registry:2 always; Argo CD for argocd-*; Flux 2 controllers for flux-oci), then runs validate-scheduling.sh for the recipe.

Registry host port

The Kind cluster exposes the in-cluster registry:2 Service on host port 5500 (kwok/kind-config.yaml’s extraPortMappings). The unconventional choice avoids Apple ControlCenter (AirPlay / Handoff), which listens on host port 5000 by default on macOS and would otherwise fail kind create cluster with a port-bind error. Linux CI runners have 5500 free as well, so the same default works everywhere.

The in-cluster NodePort (30500) and the Service containerPort (5000) are hardcoded and independent of the host-side mapping — Argo CD’s repo-server reaches the registry via Service DNS (registry.aicr-registry.svc.cluster.local:5000) regardless.

Override KWOK_REGISTRY_HOST_PORT only when running against a non-standard cluster topology (e.g., port 5500 is already in use on your host). The variable adjusts the reachability probe; you still need to update kwok/kind-config.yaml’s hostPort to match.

Sweeping All Deployers Locally

make kwok-test-all still defaults to helm and there is no matrix-aware make target — CI does the fan-out via the workflow matrix. To reproduce the matrix locally, loop in shell:

$for d in helm argocd-oci argocd-helm-oci flux-oci; do
$ make kwok-test-deployer RECIPE=eks-training DEPLOYER="$d" || break
$done

For the full recipe set under a single deployer, call the script directly:

$bash kwok/scripts/run-all-recipes.sh --deployer argocd-oci

Failure Modes and Exit Codes

The three scripts emit distinct exit codes so callers (CI, the Make target, local loops) can branch on failure mode rather than parsing logs.

ScriptCodeMeaning
install-infra.sh10yq missing or required .settings.yaml field absent
install-infra.sh20Registry Deployment not Ready within 120 s
install-infra.sh21Registry not reachable on host port within 60 s
install-infra.sh30Argo CD Helm install failed
install-infra.sh31applications.argoproj.io CRD not Established within 120 s
install-infra.sh40Repository secret apply failed
install-infra.sh60Flux install manifest apply failed
install-infra.sh61Flux controller (source/kustomize/helm) not Ready within 180 s
install-infra.sh62Flux CRDs (OCIRepository/Kustomization/HelmRelease) not Established within 60 s
validate-scheduling.sh50GitOps sync deadline hit (KWOK_ARGOCD_SYNC_TIMEOUT for argocd-*, KWOK_FLUX_SYNC_TIMEOUT for flux-oci); only GitOps lanes can return 50
run-all-recipes.sh50Three consecutive GitOps sync timeouts; tripped the ADR-008 3-strike rule and bailed the whole job

Exit code 50 from validate-scheduling.sh is intentionally distinct from generic non-zero exits: a sync-deadline timeout is qualitatively different from a bundle-render failure or a scheduling-shape mismatch, and the 3-strike rule in run-all-recipes.sh only counts code-50 strikes.

Tuning the Sync Deadline

Four environment variables shape how long the GitOps lanes wait before declaring a sync timeout. The Argo CD pair is independent of the Flux pair so the two GitOps lanes can be tuned separately when a recipe has deployer-specific reconciliation overhead.

VariableDefaultPurpose
KWOK_ARGOCD_SYNC_TIMEOUT300 (seconds)Total deadline for all child Argo CD Applications to reach Synced + Healthy
KWOK_ARGOCD_ROOT_GRACE30 (seconds)Grace period for the root Application to appear before deadline accounting starts
KWOK_FLUX_SYNC_TIMEOUT300 (seconds)Total deadline for OCIRepository fetch + Kustomization apply + all HelmRelease CRs to reach Ready=True + ArtifactGenerator CRs Ready (when present)
KWOK_FLUX_ROOT_GRACE30 (seconds)Grace period for the outer Kustomization to appear before deadline accounting starts

On a clean local Kind cluster the Phase-0 spike observed Synced+Healthy in roughly 30 seconds. CI runners are slower and contended, so the 300-second default exists to absorb that variance. If a local run trips code 50 but the cluster is otherwise healthy, raise KWOK_ARGOCD_SYNC_TIMEOUT / KWOK_FLUX_SYNC_TIMEOUT before assuming the recipe is broken — image pulls on a cold cluster are the most common cause.

Debugging CI Failures

When the kwok-test composite action fails, it uploads an artifact named kwok-debug-<recipe>-<deployer>-<run_id> containing:

  • <cluster>-resources.txt — kubectl get all --all-namespaces
  • <cluster>-nodes.txt — kubectl get nodes -o wide
  • <cluster>-events.txt — events sorted by lastTimestamp
  • <cluster>-pods.txt — pod listing across all namespaces
  • <cluster>-argo-apps.yaml — applications.argoproj.io YAML in argocd namespace (argocd lanes only)
  • <cluster>-argo-reposerver.log — last 500 lines of argocd-repo-server
  • <cluster>-argo-appcontroller.log — last 500 lines of argocd-application-controller
  • <cluster>-flux-resources.yaml — YAML for ocirepositories, kustomizations, helmreleases, artifactgenerators, and externalartifacts across all namespaces (flux-oci lane only)
  • <cluster>-flux-source-controller.log — last 500 lines of source-controller
  • <cluster>-flux-kustomize-controller.log — last 500 lines of kustomize-controller
  • <cluster>-flux-helm-controller.log — last 500 lines of helm-controller
  • <cluster>-registry.log — last 200 lines of the in-cluster registry:2 Deployment

The repo-server log (Argo CD) and source-controller log (Flux) are the first places to look for OCI-pull failures (plain-HTTP scheme errors, mediaType mismatches). The application-controller (Argo CD) and kustomize-controller (Flux) logs show reconciliation decisions and prune behavior; helm-controller logs surface per-HelmRelease install and upgrade outcomes.

Adding a New Deployer Value

The deployer set is intentionally finite and matches what pkg/bundler emits. To add a new value (say, argocd-git):

  1. Add a case branch in kwok/scripts/validate-scheduling.sh’s resolve_argocd_root_app() (or the equivalent resolve_flux_root_names() if the new lane reconciles via Flux) mapping the new deployer to its root-resource name, plus any new branches in generate_bundle and deploy_bundle. Reuse the existing argocd-oci / flux-oci branches as templates.
  2. Extend the DEPLOYER allowlist in kwok/scripts/run-all-recipes.sh (the case "$DEPLOYER" in block in main()).
  3. Extend the case "${DEPLOYER}" branches in kwok/scripts/install-infra.sh’s main() so the right controller stack is installed for the new deployer.
  4. Extend the deployer: input description in .github/actions/kwok-test/action.yml so callers see the new value in the input docs.
  5. Add the value to the deployer: matrix in Tier 1 and Tier 3 of .github/workflows/kwok-recipes.yaml. Leave Tier 2 alone — the orthogonality rationale above still applies.
  6. Add a row to the Deployer Coverage Matrix table on this page so contributors can discover the new lane.

If the new value requires changes to the in-cluster infra (a different registry, a different Argo CD chart, additional CRDs), update install-infra.sh and pin any new versions in .settings.yaml rather than hardcoding. The exit-code taxonomy in Failure Modes and Exit Codes is contiguous — pick the next free code if a new distinct failure mode appears.