Multi-cluster Local Development with Helmfile
Multi-cluster Local Development with Helmfile
Multi-cluster Local Development with Helmfile
Install the NVCF self-hosted control plane on one local k3d cluster and the NVCA operator on a separately registered compute cluster, all driven by the documented Helmfile workflow.
This setup is for local development only. It uses fake GPUs, a single Cassandra replica, and ephemeral storage. Do not use this for production workloads.
Cross-cluster traffic from the compute cluster reaches the control-plane
load balancer via .test host aliases that
tools/ncp-local-cluster/scripts/configure-control-plane-endpoints.sh
provisions:
http://sis.nvcf-control-plane.test:8080http://reval.nvcf-control-plane.test:8080nats://nats.nvcf-control-plane.test:4222Install the following tools:
Docker (running)
k3d v5.x or later
kubectl
helm >= 3.12
helmfile >= 1.1.0, < 1.2.0
helm-diff plugin: helm plugin install https://github.com/databus23/helm-diff
An NGC API key from ngc.nvidia.com with access to the NVCF chart and image registry.
The NGC organization and team slugs that hold the chart/image repository you have access to.
nvcf-cli built from this repo. Steps 9 and 10 pass
NVCF_CLI=$(pwd)/nvcf-cli to the make targets, so the binary must
exist on disk before those steps run:
Export the env vars used below:
The single-cluster (ncp-local) and multi-cluster
(ncp-local-cp + ncp-local-compute-N) topologies both claim host
ports 8080/8443/4222 and cannot coexist. If you already have the
single-cluster topology running:
The values-driven Helmfile path has no control-plane profile; the operator must author topology-correct URLs in the environment file. Use the multi-cluster fixture (NOT the single-cluster one):
Substitute your NGC org and team:
The multi-cluster fixture’s global.nvcaOperator.selfManaged.* URLs use
.test hostnames. The single-cluster fixture’s in-cluster DNS (for example
http://api.sis.svc.cluster.local:8080) would resolve only inside the
control-plane cluster and the NVCA agent on the compute cluster would 401
against ICMS at runtime. Use the right fixture.
Helmfile install runs against the ambient kubectl context. Switch to the control-plane cluster so the install lands there:
The 18 standard helm releases land on k3d-ncp-local-cp (see the
single-cluster Helmfile page for the full release list).
This single context switch is the most error-prone step in the multi-cluster
flow. The next step’s nvcf-cli cluster register (run internally by
make register-cluster) auto-discovers the target cluster’s OIDC issuer
and JWKS by running a probe Job in the CURRENT kubectl context. If you skip
the switch, the control-plane cluster’s JWKS gets registered as the compute
cluster’s identity, and the compute agent’s PSAT tokens 401 against ICMS at
runtime.
make register-cluster runs nvcf-cli init internally before
cluster register, so this flow does not need a separate init step.
The target produces
deploy/stacks/self-managed/out/ncp-local-compute-1-register-values.yaml.
The NVCFBackend resource is created on the compute cluster, not the control-plane cluster. Use the compute cluster context for all verification:
Confirm the control-plane API is reachable (from the host):
Remove the helm releases on both clusters but keep the topology (stack-only):
Or destroy the whole topology: