Local Development (k3d)

View as Markdown

Run the full NVCF self-hosted control plane on your laptop using k3d for development, testing, or demos.

This setup is for local development only. It uses fake GPUs, a single Cassandra replica, and ephemeral storage. Do not use this for production workloads.

Assumptions

This guide assumes:

  • Helm charts are pulled from the NGC registry (nvcr.io/0833294136851237/nvcf-ncp-staging)
  • Container images are pulled from the same NGC registry
  • Image pull secrets are configured in the environment YAML using imagePullSecrets to authenticate with NGC

If you are using a different registry (e.g., Amazon ECR, a private Harbor instance, or a local mirror), update the helm.sources and image sections in the environment file and adjust the pull secret configuration accordingly. See self-hosted-image-mirroring for details on mirroring artifacts to other registries.

A ready-to-use k3d configuration and setup script is available in the nv-cloud-function-helpers repository. Clone it and run ./setup.sh to create the cluster with all prerequisites, then skip to [Deploy the NVCF Stack].

Prerequisites

Install the following tools:

  • Docker (running)
  • k3d v5.x or later
  • kubectl
  • helm >= 3.12
  • helmfile >= 1.1.0, < 1.2.0
  • helm-diff plugin (helm plugin install https://github.com/databus23/helm-diff)
  • NGC API Key from ngc.nvidia.com with access to the NVCF chart/image registry

Step 1: Create the k3d Cluster

Save the following configuration as k3d-config.yaml:

1apiVersion: k3d.io/v1alpha5
2kind: Simple
3metadata:
4 name: ncp-local
5
6image: rancher/k3s:v1.30.2-k3s2
7servers: 1
8agents: 5
9
10ports:
11 - port: 8080:80
12 nodeFilters:
13 - loadbalancer
14 - port: 8443:443
15 nodeFilters:
16 - loadbalancer
17
18options:
19 k3d:
20 wait: true
21 k3s:
22 extraArgs:
23 - arg: "--disable=traefik"
24 nodeFilters:
25 - server:*
26 nodeLabels:
27 - label: run.ai/simulated-gpu-node-pool=default
28 nodeFilters:
29 - agent:3
30 - agent:4
31 - label: nvidia.com/gpu.family=hopper
32 nodeFilters:
33 - agent:3
34 - agent:4
35 - label: nvidia.com/gpu.machine=NVIDIA-DGX-H100
36 nodeFilters:
37 - agent:3
38 - agent:4
39 - label: nvidia.com/cuda.driver.major=535
40 nodeFilters:
41 - agent:3
42 - agent:4

This creates a 6-node cluster: 1 server (control plane) and 5 agents. Agents 3 and 4 are pre-labeled for the fake GPU operator. Traefik is disabled because NVCF uses Envoy Gateway.

Create the cluster:

$k3d cluster create --config k3d-config.yaml

Verify:

$kubectl get nodes
$# Expected: 6 nodes (1 server + 5 agents), all Ready

Step 2: Install the Fake GPU Operator

The fake GPU operator simulates GPU resources on the pre-labeled nodes so the NVCA agent can discover them. See fake-gpu-operator for full details.

$# Install KWOK (required by the fake GPU operator)
$kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.7.0/kwok.yaml
$kubectl wait --for=condition=Available deployment/kwok-controller -n kube-system --timeout=60s
$
$# Install the fake GPU operator
$helm repo add fake-gpu-operator \
> https://runai.jfrog.io/artifactory/api/helm/fake-gpu-operator-charts-prod --force-update
$
$helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator \
> -n gpu-operator --create-namespace \
> --set 'topology.nodePools.default.gpuCount=8' \
> --set 'topology.nodePools.default.gpuProduct=NVIDIA-H100-80GB-HBM3' \
> --set 'topology.nodePools.default.gpuMemory=81559'

Verify fake GPUs appear on the labeled nodes:

$kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
$# Agents 3 and 4 should show GPU: 8

Step 3: Install CSI SMB Driver

The CSI SMB driver is required for NVCA shared model cache storage:

$helm repo add csi-driver-smb \
> https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts
$
$helm install csi-driver-smb csi-driver-smb/csi-driver-smb \
> -n kube-system --version v1.17.0

Deploy the NVCF Stack

With the cluster ready, follow the helmfile-installation guide. The steps below call out the local-specific differences for each step.

Step 1 (Ingress)

Follow as documented, but skip the cloud-provider annotations on the Gateway resource. k3d handles LoadBalancer services automatically via its built-in klipper-lb.

Step 2 (Environment file)

Create a local development environment file from the template below (local-dev-env.yaml). Save it as environments/<name>.yaml (e.g., environments/my-local.yaml) in your nvcf-self-managed-stack directory.

environments/my-local.yaml
1# NVCF Self-Hosted Local Development Environment
2# For use with k3d clusters. See the Local Development guide for setup instructions.
3#
4# Save this file as environments/<name>.yaml in your nvcf-self-managed-stack directory.
5# Create a matching secrets/<name>-secrets.yaml file with your registry credentials.
6# Deploy with: HELMFILE_ENV=<name> helmfile sync
7
8global:
9 # Domain for local access (routes use .localhost TLD)
10 domain: "localhost"
11
12 # Helm chart registry (where helmfile pulls OCI charts from)
13 helm:
14 sources:
15 registry: nvcr.io
16 repository: 0833294136851237/nvcf-ncp-staging
17
18 # Container image registry (where Kubernetes pulls images from)
19 image:
20 registry: nvcr.io
21 repository: 0833294136851237/nvcf-ncp-staging
22
23 # Pull secret created by create-nvcr-pull-secrets.sh (run once before deploying)
24 imagePullSecrets:
25 - name: nvcr-pull-secret
26
27 # Disable node selectors for local development (pods schedule on any node)
28 nodeSelectors:
29 enabled: false
30
31 # k3d uses the local-path StorageClass by default
32 storageClass: local-path
33 storageSize: 2Gi
34
35 observability:
36 tracing:
37 enabled: false
38 collectorEndpoint: ""
39 collectorPort: 4317
40 collectorProtocol: http
41
42# Single Cassandra replica for local development
43cassandra:
44 enabled: true
45 replicaCount: 1
46 jvm:
47 # Fast startup options -- only safe with a single replica.
48 # Do NOT use these settings with multiple replicas.
49 extraOpts: "-Dcassandra.superuser_setup_delay_ms=100 -Dcassandra.gossip_settle_min_wait_ms=1000"
50
51nats:
52 enabled: true
53
54openbao:
55 enabled: true
56 migrations:
57 issuerDiscovery:
58 enabled: true
59
60# Gateway configuration matching the standard control plane installation Step 1
61ingress:
62 gatewayApi:
63 enabled: true
64 controllerNamespace: envoy-gateway-system
65 routes:
66 nvcfApi:
67 routeAnnotations: {}
68 apiKeys:
69 routeAnnotations: {}
70 invocation:
71 routeAnnotations: {}
72 grpc:
73 routeAnnotations: {}
74 gateways:
75 shared:
76 name: nvcf-gateway
77 namespace: envoy-gateway
78 listenerName: http
79 grpc:
80 name: nvcf-gateway
81 namespace: envoy-gateway
82 listenerName: tcp

This template is pre-configured for local development:

  • Storage: local-path (2Gi volumes, the default k3d StorageClass)
  • Cassandra: Single replica with fast startup JVM options
  • Node selectors: Disabled (pods schedule on any available node)
  • Registry: nvcr.io/0833294136851237/nvcf-ncp-staging
  • Gateway: nvcf-gateway in envoy-gateway namespace (matches Step 1)
  • Domain: localhost
  • imagePullSecrets: Pre-configured to reference nvcr-pull-secret (created in Step 4)

Step 3 (Secrets)

Create secrets/<name>-secrets.yaml (e.g., secrets/my-local-secrets.yaml) from the template in the control plane guide. The file name must match your environment name. Fill in your NGC base64-encoded credentials for the NGC org you’ll be deploying function images from:

$echo -n '$oauthtoken:YOUR_NGC_API_KEY' | base64

Step 4 (Pull secrets)

Run the helper script to create the nvcr-pull-secret Kubernetes secret in all NVCF namespaces:

$export NGC_API_KEY="<your-ngc-api-key>"
$bash samples/scripts/create-nvcr-pull-secrets.sh

The environment file template from Step 2 already references this secret via imagePullSecrets.

Step 5 (Deploy)

Authenticate helm and deploy using your environment name:

$helm registry login nvcr.io -u '$oauthtoken' -p "$NGC_API_KEY"
$HELMFILE_ENV=<name> helmfile sync

Replace <name> with the name you chose for your environment file (e.g., my-local).

Step 6 (Verify)

Check that all pods are running:

$kubectl get pods -A -o wide
$# All pods should be Running or Completed
$
$helm list -A
$# All releases should show STATUS: deployed

Verify the NVCA agent discovered the fake GPUs:

$kubectl get nvcfbackends -n nvca-operator
$# Expected: nvcf-default healthy
$
$kubectl get nvcfbackends -n nvca-operator -o jsonpath='{.items[0].status.gpuUsage}' | python3 -m json.tool
$# Expected: {"H100": {"available": 16, "capacity": 16}}

Verify API connectivity using the .localhost routing (not the Gateway address, which is cluster-internal on k3d):

$# Generate an admin token
$export NVCF_TOKEN=$(curl -s -X POST "http://api-keys.localhost:8080/v1/admin/keys" \
> | python3 -c "import sys,json; print(json.load(sys.stdin)['value'])")
$
$echo "Token: ${NVCF_TOKEN:0:20}..."
$
$# List functions (should return empty)
$curl -s "http://api.localhost:8080/v2/nvcf/functions" \
> -H "Authorization: Bearer ${NVCF_TOKEN}" | python3 -m json.tool
$# Expected: {"functions": []}

The standard control plane verification commands use the Gateway address from kubectl get gateway. On k3d this returns a cluster-internal IP that is not reachable from the host. Use localhost:8080 with .localhost hostnames instead, as shown above.

Accessing Routes Locally

NVCF routes use the .localhost top-level domain, which resolves to 127.0.0.1 automatically on most systems. Access services via the k3d load balancer on port 8080:

  • http://api.localhost:8080 — NVCF API
  • http://api-keys.localhost:8080 — API Keys service
  • http://invocation.localhost:8080 — Function invocation

If .localhost does not resolve automatically, add entries to /etc/hosts:

127.0.0.1 api.localhost
127.0.0.1 api-keys.localhost
127.0.0.1 invocation.localhost

Wildcard subdomains (e.g., <function-id>.invocation.localhost) cannot be added to /etc/hosts. For local testing with dynamic function IDs, add specific entries or use a local DNS resolver such as dnsmasq.

Teardown

$# Remove the NVCF stack (use your environment name)
$HELMFILE_ENV=<name> helmfile destroy
$
$# Delete the k3d cluster
$k3d cluster delete ncp-local

Limitations

  • Fake GPUs — Function containers will be scheduled and deployed but cannot execute actual GPU workloads.
  • Single Cassandra replica — No high availability. Data may be lost on pod restart.
  • Ephemeral storagelocal-path volumes are deleted when the cluster is destroyed.
  • Not suitable for performance testing — Resource constraints of a laptop do not represent production environments.