Multi-cluster Local Development with the CLI
Multi-cluster Local Development with the CLI
Multi-cluster Local Development with the CLI
Install a NVCF self-hosted control plane on one local k3d cluster and a
separately registered compute plane on a second cluster, all using
nvcf-cli. Useful when you want to exercise the multi-cluster install and
registration paths before targeting real infrastructure.
This setup is for local development only. It uses fake GPUs, a single Cassandra replica, and ephemeral storage. Do not use this for production workloads.
The CLI writes .localhost URLs into the control-plane profile and
flows them through to the per-cluster register-values as-is. The NVCA
agent on the compute cluster uses those URLs at runtime to reach cp
services. The docker network shared between the two k3d clusters
(plus the install-time wiring make build-and-deploy-multicluster
sets up) is what makes the cross-cluster reach work.
For users coming from the Helmfile install path: that flow is
values-driven and uses the .nvcf-control-plane.test aliases
provisioned by tools/ncp-local-cluster/scripts/configure-control-plane-endpoints.sh.
The CLI path does not depend on those aliases.
Install the following tools:
Docker (running)
k3d v5.x or later
kubectl
helm >= 3.12
An NGC API key from ngc.nvidia.com with access to the NVCF chart and image registry.
The NGC organization and team slugs that hold the chart and image
repository you have access to. make build-and-deploy-multicluster
reads these from SAMPLE_NGC_ORG / SAMPLE_NGC_TEAM during its
credential provider validation step; without them, the build target
fails and skips its final gateway-API setup.
nvcf-cli built from this repo:
Export the env vars used by the cluster bootstrap and the install steps:
This creates ncp-local-cp plus ncp-local-compute-1, installs the fake
GPU operator and CSI SMB driver on the compute cluster, configures DNS for
the .test aliases, and validates Envoy Gateway on the control-plane cluster.
The single-cluster (ncp-local) and multi-cluster
(ncp-local-cp + ncp-local-compute-N) topologies both claim host
ports 8080/8443/4222 and cannot coexist. If you already have the
single-cluster topology running:
build-and-deploy-multicluster runs setup-gateway-api,
check-gateway-api, and validate-gateway on the control-plane cluster
as its final steps. If any earlier step fails (for example, credential
provider validation when SAMPLE_NGC_ORG / SAMPLE_NGC_TEAM are not
set), gateway setup is skipped. After fixing the underlying issue,
re-run just the gateway-API setup on the cp cluster:
nvcf-cli self-hosted install renders helmfile manifests that reference
imagePullSecrets: [{name: nvcr-pull-secret}]. Create the secret in each
NVCF namespace on the control-plane cluster (k3d-ncp-local-cp) before
running install so pods can pull images from nvcr.io. Set the kubectl
context to the cp cluster first if you have not already:
The loop is idempotent (uses kubectl apply). Pull secrets for the compute
cluster (k3d-ncp-local-compute-1) are configured by compute-plane install
later in this flow.
The install command needs both contexts so it knows which cluster gets each plane:
--token DUMMY skips the install command’s check-cp auth gate. The
install path itself never consumes the token. See the single-cluster CLI
page for the full explanation.
When this completes, a control-plane profile is written to
deploy/stacks/self-managed/out/control-plane-profile.yaml. It carries both
URL blocks:
controlPlane.endpoints.inCluster.* - resolves only inside the
control-plane cluster (for example http://api.sis.svc.cluster.local:8080).controlPlane.endpoints.computeReachable.* - the .localhost URLs
the CLI writes for cluster-external consumers. These flow through
to the register-values in Step 6 as-is; compute-plane register
does not rewrite them.compute-plane register picks the right block by inspecting
--kube-context against the cp context.
The --kube-context flag selects the compute cluster, which causes the CLI
to pick the computeReachable URL block from the profile and write those
URLs straight into the register-values file. The NVCA agent on the compute
cluster uses those URLs at runtime to reach cp services.
The output file’s selfManaged block contains the .test hostnames, not
the in-cluster service URLs.
nvcf-cli cluster register (run internally during this step) auto-discovers
the target cluster’s OIDC issuer and JWKS by running a probe Job in the
cluster identified by --kube-context. That identity is what ICMS validates
when the compute agent presents PSAT tokens at runtime. Always set
--kube-context to the COMPUTE cluster.
The NVCFBackend resource is created on the compute cluster, not the control-plane cluster.
Confirm the control-plane API is reachable (from the host, where
api.localhost resolves to 127.0.0.1):
Remove the helm releases on both clusters but keep the topology (stack-only):
Or destroy the whole topology: