Self-Managed Clusters#

GPU clusters are registered with the NVCF control plane using the NVCA Operator. The operator creates its configuration locally via Kubernetes ConfigMaps and authenticates through the local OpenBao (Vault) instance.

Important

A running NVCF control plane (SIS, OpenBao, NATS, Cassandra, and all core services) is required. Install the control plane using either the standalone Helm chart installation or the Helmfile-based installation before proceeding.

Prerequisites#

Before installing the NVCA Operator, ensure the following prerequisites are met:

  • The control plane is installed and all core services are running.

  • The NVIDIA GPU Operator is installed on the GPU cluster. The GPU Operator manages the NVIDIA drivers, device plugin, and GPU feature discovery required for workload scheduling. For development or testing environments without physical GPUs, see Fake GPU Operator (Development / Testing).

  • The KAI Scheduler is installed on the GPU cluster. KAI Scheduler is required for optimized AI workload scheduling and bin-packing of GPU resources.

  • GPU Workload Components must be available in a user-managed registry that your Kubernetes cluster can access. See GPU Workload Components under Artifact Manifest for necessary artifacts and Image Mirroring for mirroring instructions.

  • The SMB CSI driver (smb.csi.k8s.io) must be installed on the GPU cluster. It is required for NVCA shared model cache storage (samba sidecar). Install it with:

    helm repo add csi-driver-smb \
      https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts
    
    helm install csi-driver-smb csi-driver-smb/csi-driver-smb \
      -n kube-system --version v1.17.0
    

How It Works#

When ngcConfig.clusterSource is set to self-managed, the NVCA Operator uses a local bootstrap process:

  1. The Helm chart creates a local ConfigMap (nvcfbackend-self-managed) with cluster configuration from your values file (name, region, capabilities, SIS endpoint).

  2. The ngcConfig.serviceKey field is required by the chart schema but not used. Set it to any non-empty string.

  3. An init container (nvca-self-managed-bootstrap) runs before the operator starts. It reads the vault-injected SIS JWT token and registers the cluster with the local SIS service, writing the resulting clusterId and clusterGroupId to the nvca-cluster-registration ConfigMap.

  4. The operator starts with the cluster already registered. It reads the IDs from the ConfigMap and creates the NVCFBackend resource.

  5. The NVCA agent pod is created by the operator. It authenticates with SIS using a vault-injected token and begins managing GPU workloads.

  6. Secrets (Cassandra credentials, registry pull secrets) are injected by the OpenBao vault agent sidecar, which authenticates using Kubernetes service account JWT tokens against the local OpenBao instance.

Installing the NVCA Operator#

Chart

helm-nvca-operator

Version

1.6.6

Namespace

nvca-operator

Depends on

All control plane services and gateway must be running

Configuration#

Create nvca-operator-values.yaml (download template):

nvca-operator-values.yaml
nvca-operator-values.yaml#
# NVCA Operator values for standalone self-managed installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.

# Operator image
image:
  repository: "<REGISTRY>/<REPOSITORY>/nvca-operator"

# NVCA agent image
nvcaImage:
  repositoryOverride: "<REGISTRY>/<REPOSITORY>/nvca"
  
generateImagePullSecret: false
# Image pull secret for private registries. Create the secret in the nvca-operator
# namespace, then uncomment the lines below. The chart passes this to both the
# operator and all NVCA agent pods (including samba and image-credential-helper).
# imagePullSecrets:
#   - name: nvcr-pull-secret

# NGC configuration -- self-managed mode does not use NGC cloud services.
# The serviceKey is not used but the field is required by the chart.
ngcConfig:
  clusterSource: self-managed
  serviceKey: "not-used"

# Self-managed backend configuration
selfManaged:
  nvcaVersion: "2.52.0-rc.5"  # NVCA agent version to deploy
  featureGateValues: ["DynamicGPUDiscovery", "SelfHosted", "KAIScheduler"]
  imageCredHelper:
    imageRepository: "<REGISTRY>/<REPOSITORY>/nvcf-image-credential-helper"
  sharedStorage:
    imageRepository: "<REGISTRY>/<REPOSITORY>/samba"
# Uncomment for node selectors
# nodeSelector:
#   key: nvcf.nvidia.com/workload
#   value: control-plane

Replace all <REGISTRY> and <REPOSITORY> placeholders with your registry settings.

Key values:

ngcConfig.clusterSource

Must be self-managed for self-hosted deployments

ngcConfig.serviceKey

Required by the chart but not used. Set to any non-empty string.

selfManaged.nvcaVersion

The NVCA agent container version to deploy

selfManaged.featureGateValues

Feature flags. DynamicGPUDiscovery enables automatic GPU detection. SelfHosted enables local vault-based auth.

selfManaged.imageCredHelper

Image credential helper sidecar (enables function pods to pull from private registries)

selfManaged.sharedStorage

Samba sidecar for shared model cache storage

If you are using node selectors, uncomment the nodeSelector section.

See also

For the full list of available feature flags and how to set or modify them, see Managing Feature Flags.

Image Pull Secrets#

The NVCA operator, NVCA agent, samba sidecar, and image-credential-helper all pull container images from the registry configured in your values file. If that registry is private, you must create a Kubernetes pull secret and reference it in the Helm values so that all pods can authenticate.

Note

The chart provides a generateImagePullSecret option, but this does not work in self-managed mode. It generates a pull secret from ngcConfig.serviceKey, which is set to a dummy value in self-managed deployments. Use imagePullSecrets with a pre-existing secret instead.

1. Create the pull secret in the nvca-operator and nvcf-backend namespaces:

for ns in nvca-operator nvcf-backend; do
  kubectl create namespace "$ns" --dry-run=client -o yaml | kubectl apply -f -
  kubectl create secret docker-registry nvcr-pull-secret \
    --docker-server=<REGISTRY> \
    --docker-username='$oauthtoken' \
    --docker-password="$REGISTRY_PASSWORD" \
    --namespace="$ns" \
    --dry-run=client -o yaml | kubectl apply -f -
done

Replace <REGISTRY> with your container registry (e.g., nvcr.io). For non-NGC registries, replace --docker-username and --docker-password with your registry credentials. For NGC (nvcr.io), $REGISTRY_PASSWORD is your NGC Personal Key or API Key.

The secret is needed in both namespaces because the operator and agent pods run in nvca-operator, while deployed function pods (which include the samba and image-credential-helper sidecars) run in nvcf-backend.

2. Reference the secret in your values file. Add the following to nvca-operator-values.yaml:

imagePullSecrets:
  - name: nvcr-pull-secret

The chart passes this secret to both the operator deployment and all NVCA agent pods it creates. The operator also propagates it to function pods in the nvcf-backend namespace.

Install#

helm upgrade --install nvca-operator \
  oci://${REGISTRY}/${REPOSITORY}/helm-nvca-operator \
  --namespace nvca-operator --create-namespace \
  --wait --timeout 10m \
  -f nvca-operator-values.yaml \
  --version 1.6.6

During installation, the chart will:

  1. Create the operator deployment with vault agent annotations for OpenBao auth

  2. Run the nvca-self-managed-bootstrap init container to register the cluster with SIS

  3. Start the operator, which creates the NVCFBackend and NVCA agent deployment

Verify#

Check the operator pod is running (4 containers: operator, nvca-mirror, nvca-bootstrap-watch, vault-agent):

kubectl get pods -n nvca-operator

# Expected:
# NAME                             READY   STATUS    RESTARTS   AGE
# nvca-operator-...                4/4     Running   0          1m

Check the bootstrap init container completed successfully:

kubectl logs -n nvca-operator -c nvca-self-managed-bootstrap \
  -l app.kubernetes.io/name=nvca-operator --tail=5

# Expected: "Bootstrap completed successfully" with cluster_id and cluster_group_id

Check the cluster registration ConfigMap has IDs populated:

kubectl get cm nvca-cluster-registration -n nvca-operator \
  -o jsonpath='{.data}'

# Expected: {"clusterGroupId":"<uuid>","clusterId":"<uuid>"}

Check the NVCFBackend resource was created:

kubectl get nvcfbackends -n nvca-operator

# Expected: one NVCFBackend resource with version and health status

Check the NVCA agent pod is running (the operator creates this automatically):

kubectl get pods -n nvca-system

# Expected:
# NAME                    READY   STATUS    RESTARTS   AGE
# nvca-...                3/3     Running   0          2m

Note

The NVCA agent pod has 3 containers: the agent, the admission webhook, and the OpenBao vault agent sidecar. Both should show Running. If the vault agent sidecar is in CrashLoopBackOff, verify that OpenBao is healthy and the migration jobs completed successfully.

Verify GPU discovery:

kubectl get nvcfbackends -n nvca-operator -o jsonpath='{.items[0].status}' | python3 -m json.tool

# Look for GPU information in the status output

Verify Workload Scheduling#

1. Set up environment variables:

# Get the Gateway address (from Step 1)
export GATEWAY_ADDR=$(kubectl get gateway nvcf-gateway -n envoy-gateway -o jsonpath='{.status.addresses[0].value}')
echo "Gateway Address: $GATEWAY_ADDR"

2. Generate an admin token:

# Generate an admin API token
export NVCF_TOKEN=$(curl -s -X POST "http://${GATEWAY_ADDR}/v1/admin/keys" \
  -H "Host: api-keys.${GATEWAY_ADDR}" \
  | grep -o '"value":"[^"]*"' | cut -d'"' -f4)

echo "Token generated: ${NVCF_TOKEN:0:20}..."

3. Create, deploy, and invoke a test function:

# Create a test function
# Replace <YOUR_REGISTRY>/<YOUR_REPO> with your container registry
# This should match the registry you set in the secrets file
curl -s -X POST "http://${GATEWAY_ADDR}/v2/nvcf/functions" \
  -H "Host: api.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${NVCF_TOKEN}" \
  -d '{
    "name": "my-echo-function",
    "inferenceUrl": "/echo",
    "healthUri": "/health",
    "inferencePort": 8000,
    "containerImage": "<YOUR_REGISTRY>/<YOUR_REPO>/load_tester_supreme:0.0.8"
  }' | jq .

# Extract function and version IDs from the response
export FUNCTION_ID=<function-id-from-response>
export FUNCTION_VERSION_ID=<version-id-from-response>

# Deploy the function
# Adjust instanceType and gpu based on your cluster configuration
# Instance Type Examples: NCP.GPU.A10G_1x, NCP.GPU.H100_1x, NCP.GPU.L40S_1x, etc.
# GPU Examples: A10G, H100, L40S, etc.
curl -s -X POST "http://${GATEWAY_ADDR}/v2/nvcf/deployments/functions/${FUNCTION_ID}/versions/${FUNCTION_VERSION_ID}" \
  -H "Host: api.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${NVCF_TOKEN}" \
  -d '{
    "deploymentSpecifications": [
      {
        "instanceType": "NCP.GPU.A10G_1x",
        "backend": "nvcf-default",
        "gpu": "A10G",
        "maxInstances": 1,
        "minInstances": 1
      }
    ]
  }' | jq .

# Generate an API key for invocation (note the required scopes, the NVCF API Open API Spec under "API" page has all scopes documented per endpoint)
# Set expiration to 1 day from now (required field)
EXPIRES_AT=$(date -u -v+1d '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '+1 day' '+%Y-%m-%dT%H:%M:%SZ')
SERVICE_ID="nvidia-cloud-functions-ncp-service-id-aketm"

export API_KEY=$(curl -s -X POST "http://${GATEWAY_ADDR}/v1/keys" \
  -H "Host: api-keys.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Key-Issuer-Service: nvcf-api" \
  -H "Key-Issuer-Id: ${SERVICE_ID}" \
  -H "Key-Owner-Id: test@nvcf-api.local" \
  -d '{
    "description": "test invocation key",
    "expires_at": "'"${EXPIRES_AT}"'",
    "authorizations": {
      "policies": [{
        "aud": "'"${SERVICE_ID}"'",
        "auds": ["'"${SERVICE_ID}"'"],
        "product": "nv-cloud-functions",
        "resources": [
          {"id": "*", "type": "account-functions"},
          {"id": "*", "type": "authorized-functions"}
        ],
        "scopes": ["invoke_function", "list_functions", "queue_details", "list_functions_details"]
      }]
    },
    "audience_service_ids": ["'"${SERVICE_ID}"'"]
  }' | jq -r '.value')

echo "API Key: ${API_KEY:0:20}..."

# Wait for deployment to be ready (list functions to see status), then invoke the function
# Uses wildcard subdomain routing: <function-id>.invocation.<gateway-addr>
curl -s -X POST "http://${GATEWAY_ADDR}/echo" \
  -H "Host: ${FUNCTION_ID}.invocation.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"message": "hello world", "repeats": 1}'

Note

The backend value should match the cluster group name registered by the NVCA operator. The instanceType and gpu values depend on the GPU types available in your cluster.

For invocation, the Host header uses wildcard subdomain routing: <function-id>.invocation.<gateway-addr>. The URL path should match the function’s inferenceUrl (e.g., /echo).

You can also use the NVCF CLI for easier function management:

  • Create, deploy, and invoke functions with simple commands

  • Create or update registry credentials without manual API calls

See Self-hosted CLI for installation and usage instructions.

Manual Cluster Registration#

The cluster bootstrap runs automatically as an init container during helm install. If you need to re-run registration (for example, after a failed install or to update the cluster registration), you can invoke the bootstrap CLI directly from the running operator pod:

kubectl exec -n nvca-operator deploy/nvca-operator -c nvca-operator -- \
  /usr/bin/nvca-self-managed bootstrap --system-namespace nvca-operator

This command:

  • Reads the cluster config from the nvcfbackend-self-managed ConfigMap

  • Authenticates with SIS using the vault-injected JWT token

  • Registers the cluster (or re-discovers an existing registration)

  • Updates the nvca-cluster-registration ConfigMap with the cluster IDs

After manual registration, restart the operator so it re-reads the updated ConfigMap. The operator caches the cluster IDs at startup and does not watch the bootstrap ConfigMap for changes, so a restart is required:

kubectl rollout restart deployment nvca-operator -n nvca-operator

Wait for the operator to come back up (the bootstrap init container will run again and confirm the existing registration), then verify the agent starts successfully:

kubectl rollout status deployment nvca-operator -n nvca-operator --timeout=120s
kubectl get pods -n nvca-system

Warning

Simply annotating the NVCFBackend to force a rollout is not sufficient after manual registration. The operator must be restarted to pick up the new cluster IDs from the ConfigMap.

Uninstalling#

To fully remove the NVCA Operator and all associated resources:

Important

If functions are currently deployed on the cluster (pods in the nvcf-backend namespace), undeploy them through the NVCF API or CLI before uninstalling the operator. Attempting to delete NVCA while function pods are running can cause finalizers to block namespace deletion. If you encounter stuck resources, see [Handling Stuck Resources] below.

  1. Delete the NVCFBackend resource (triggers operator-managed cleanup of the agent deployment, NVCA system pods, and related resources):

    kubectl delete nvcfbackends --all -n nvca-operator --timeout=60s
    
  2. Verify the agent namespace is clean before proceeding:

    kubectl get pods -n nvca-system
    
    # Expected: "No resources found in nvca-system namespace."
    
  3. Uninstall the Helm release:

    helm uninstall nvca-operator -n nvca-operator
    

    Note

    Helm will report that the nvca-cluster-registration ConfigMap was kept due to resource policy. This is intentional — it preserves the cluster registration IDs so that a reinstall can reuse them. Delete it manually if you want a completely clean removal.

  4. Clean up the retained ConfigMap (optional — skip if you plan to reinstall):

    kubectl delete cm nvca-cluster-registration -n nvca-operator --ignore-not-found
    
  5. Delete CRDs (removes all NVCFBackend, MiniService, and StorageRequest custom resources cluster-wide):

    kubectl delete crd \
      nvcfbackends.nvcf.nvidia.io \
      miniservices.nvca.nvcf.nvidia.io \
      storagerequests.nvca.nvcf.nvidia.io \
      --ignore-not-found
    
  6. Delete namespaces:

    kubectl delete namespace nvca-operator nvca-system nvca-modelcache-init --ignore-not-found
    

Handling Stuck Resources#

If step 1 times out and namespaces remain stuck in Terminating state, or function pods in nvcf-backend prevent cleanup, use the Appendix: NVCA Force Cleanup Script. This script removes finalizers on stuck NVCA resources, force-deletes function pods, and cleans up all NVCA namespaces.

# Preview what will be deleted
./force-cleanup-nvcf.sh --dry-run

# Execute the cleanup
./force-cleanup-nvcf.sh

Warning

The force cleanup script bypasses normal cleanup procedures by removing finalizers. Always attempt the ordered uninstall steps above first.

For the full script, download link, and detailed usage instructions, see the NVCA Force Cleanup Script appendix in the self-hosted troubleshooting guide.

Troubleshooting#

  • Bootstrap init container failed: Check the bootstrap logs to see why registration failed:

    kubectl logs -n nvca-operator -c nvca-self-managed-bootstrap \
      -l app.kubernetes.io/name=nvca-operator
    
  • ConfigMap shows empty cluster IDs after install: The vault token may not have been available when the init container ran. Run the bootstrap manually (see [Manual Cluster Registration] above).

  • Operator pod not starting: Check the operator logs:

    kubectl logs -n nvca-operator -l app.kubernetes.io/name=nvca-operator -c nvca-operator --tail=100
    
  • NVCA agent pod not created: The operator creates the agent pod via the NVCFBackend resource. Check the operator logs for reconciliation errors:

    kubectl describe nvcfbackends -n nvca-operator
    
  • Agent fails to register with SIS (HTTP 401): Check the bootstrap registration ConfigMap for populated cluster IDs. If they are empty, run the bootstrap manually. Also verify the vault agent sidecar on the agent pod is running and rendering the secrets file:

    kubectl logs -n nvca-system -l app.kubernetes.io/name=nvca -c vault-agent --tail=10
    
  • Vault agent sidecar failing: The agent pod needs to authenticate with OpenBao. Verify the vault system is healthy:

    kubectl exec -n vault-system openbao-server-0 -- bao status
    
  • No GPUs discovered: Ensure the GPU Operator is installed and GPU nodes have the nvidia.com/gpu resource advertised:

    kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"