Self-Managed Clusters#
GPU clusters are registered with the NVCF control plane using the NVCA Operator. The operator creates its configuration locally via Kubernetes ConfigMaps and authenticates through the local OpenBao (Vault) instance.
Important
A running NVCF control plane (SIS, OpenBao, NATS, Cassandra, and all core services) is required. Install the control plane using either the standalone Helm chart installation or the Helmfile-based installation before proceeding.
How It Works#
When ngcConfig.clusterSource is set to self-managed, the NVCA Operator uses a local
bootstrap process:
The Helm chart creates a local ConfigMap (
nvcfbackend-self-managed) with cluster configuration from your values file (name, region, capabilities, SIS endpoint).The
ngcConfig.serviceKeyfield is required by the chart schema but not used. Set it to any non-empty string.An init container (
nvca-self-managed-bootstrap) runs before the operator starts. It reads the vault-injected SIS JWT token and registers the cluster with the local SIS service, writing the resultingclusterIdandclusterGroupIdto thenvcfbackend-cluster-bootstrap-registrationConfigMap.The operator starts with the cluster already registered. It reads the IDs from the ConfigMap and creates the NVCFBackend resource.
The NVCA agent pod is created by the operator. It authenticates with SIS using a vault-injected token and begins managing GPU workloads.
Secrets (Cassandra credentials, registry pull secrets) are injected by the OpenBao vault agent sidecar, which authenticates using Kubernetes service account JWT tokens against the local OpenBao instance.
Installing the NVCA Operator#
Chart |
|
Version |
|
Namespace |
|
Depends on |
All control plane services and gateway must be running |
Configuration#
Create nvca-operator-values.yaml (download template):
nvca-operator-values.yaml
# NVCA Operator values for standalone self-managed installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
# Operator image
image:
repository: "<REGISTRY>/<REPOSITORY>/nvca-operator"
tag: "1.19.1-rc.3"
# NVCA agent image
nvcaImage:
repositoryOverride: "<REGISTRY>/<REPOSITORY>/nvca"
# NGC configuration -- self-managed mode does not use NGC cloud services.
# The serviceKey is not used but the field is required by the chart.
ngcConfig:
clusterSource: self-managed
serviceKey: "not-used"
# Self-managed backend configuration
selfManaged:
nvcaVersion: "2.50.1-rc.5" # NVCA agent version to deploy
featureGateValues: ["DynamicGPUDiscovery", "SelfHosted"]
imageCredHelper:
imageRepository: "<REGISTRY>/<REPOSITORY>/nvcf-image-credential-helper"
imageTag: "0.5.0"
sharedStorage:
imageRepository: "<REGISTRY>/<REPOSITORY>/samba"
imageTag: "1.0.5"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Replace all <REGISTRY> and <REPOSITORY> placeholders with your registry settings.
Key values:
|
Must be |
|
Required by the chart but not used. Set to any non-empty string. |
|
The NVCA agent container version to deploy |
|
Feature flags. |
|
Image credential helper sidecar (enables function pods to pull from private registries) |
|
Samba sidecar for shared model cache storage |
If you are using node selectors, uncomment the nodeSelector section.
See also
For the full list of available feature flags and how to set or modify them, see Managing Feature Flags.
Install#
helm upgrade --install nvca-operator \
oci://${REGISTRY}/${REPOSITORY}/nvca-operator \
--version 1.2.7 \
--namespace nvca-operator --create-namespace \
--wait --timeout 10m \
-f nvca-operator-values.yaml
During installation, the chart will:
Create the operator deployment with vault agent annotations for OpenBao auth
Run the
nvca-self-managed-bootstrapinit container to register the cluster with SISStart the operator, which creates the NVCFBackend and NVCA agent deployment
Verify#
Check the operator pod is running (3 containers: operator, nvca-mirror, vault-agent):
kubectl get pods -n nvca-operator
# Expected:
# NAME READY STATUS RESTARTS AGE
# nvca-operator-... 3/3 Running 0 1m
Check the bootstrap init container completed successfully:
kubectl logs -n nvca-operator -c nvca-self-managed-bootstrap \
-l app.kubernetes.io/name=nvca-operator --tail=5
# Expected: "Bootstrap completed successfully" with cluster_id and cluster_group_id
Check the cluster registration ConfigMap has IDs populated:
kubectl get cm nvcfbackend-cluster-bootstrap-registration -n nvca-operator \
-o jsonpath='{.data}'
# Expected: {"clusterGroupId":"<uuid>","clusterId":"<uuid>"}
Check the NVCFBackend resource was created:
kubectl get nvcfbackends -n nvca-operator
# Expected: one NVCFBackend resource with version and health status
Check the NVCA agent pod is running (the operator creates this automatically):
kubectl get pods -n nvca-system
# Expected:
# NAME READY STATUS RESTARTS AGE
# nvca-... 2/2 Running 0 2m
Note
The NVCA agent pod has 2 containers: the agent itself and the OpenBao vault agent sidecar.
Both should show Running. If the vault agent sidecar is in CrashLoopBackOff, verify
that OpenBao is healthy and the migration jobs completed successfully.
Verify GPU discovery:
kubectl get nvcfbackends -n nvca-operator -o jsonpath='{.items[0].status}' | python3 -m json.tool
# Look for GPU information in the status output
Manual Cluster Registration#
The cluster bootstrap runs automatically as an init container during helm install. If
you need to re-run registration (for example, after a failed install or to update the
cluster registration), you can invoke the bootstrap CLI directly from the running operator
pod:
kubectl exec -n nvca-operator deploy/nvca-operator -c nvca-operator -- \
/usr/bin/nvca-self-managed bootstrap --system-namespace nvca-operator
This command:
Reads the cluster config from the
nvcfbackend-self-managedConfigMapAuthenticates with SIS using the vault-injected JWT token
Registers the cluster (or re-discovers an existing registration)
Updates the
nvcfbackend-cluster-bootstrap-registrationConfigMap with the cluster IDs
After manual registration, restart the operator so it re-reads the updated ConfigMap. The operator caches the cluster IDs at startup and does not watch the bootstrap ConfigMap for changes, so a restart is required:
kubectl rollout restart deployment nvca-operator -n nvca-operator
Wait for the operator to come back up (the bootstrap init container will run again and confirm the existing registration), then verify the agent starts successfully:
kubectl rollout status deployment nvca-operator -n nvca-operator --timeout=120s
kubectl get pods -n nvca-system
Warning
Simply annotating the NVCFBackend to force a rollout is not sufficient after manual registration. The operator must be restarted to pick up the new cluster IDs from the ConfigMap.
Uninstalling#
To fully remove the NVCA Operator and all associated resources:
Important
If functions are currently deployed on the cluster (pods in the nvcf-backend namespace),
undeploy them through the NVCF API or CLI before uninstalling the operator. Attempting
to delete NVCA while function pods are running can cause finalizers to block namespace
deletion. If you encounter stuck resources, see Handling Stuck Resources below.
Delete the NVCFBackend resource (triggers operator-managed cleanup of the agent deployment, NVCA system pods, and related resources):
kubectl delete nvcfbackends --all -n nvca-operator --timeout=60s
Verify the agent namespace is clean before proceeding:
kubectl get pods -n nvca-system # Expected: "No resources found in nvca-system namespace."
Uninstall the Helm release:
helm uninstall nvca-operator -n nvca-operator
Note
Helm will report that the
nvcfbackend-cluster-bootstrap-registrationConfigMap was kept due to resource policy. This is intentional — it preserves the cluster registration IDs so that a reinstall can reuse them. Delete it manually if you want a completely clean removal.Clean up the retained ConfigMap (optional — skip if you plan to reinstall):
kubectl delete cm nvcfbackend-cluster-bootstrap-registration -n nvca-operator --ignore-not-found
Delete CRDs (removes all NVCFBackend, MiniService, and StorageRequest custom resources cluster-wide):
kubectl delete crd \ nvcfbackends.nvcf.nvidia.io \ miniservices.nvca.nvcf.nvidia.io \ storagerequests.nvca.nvcf.nvidia.io \ --ignore-not-found
Delete namespaces:
kubectl delete namespace nvca-operator nvca-system nvca-modelcache-init --ignore-not-found
Handling Stuck Resources#
If step 1 times out and namespaces remain stuck in Terminating state, or function pods in
nvcf-backend prevent cleanup, use the force cleanup script. This script removes
finalizers on stuck NVCA resources, force-deletes function pods, and cleans up all NVCA
namespaces.
# Preview what will be deleted
./force-cleanup-nvcf.sh --dry-run
# Execute the cleanup
./force-cleanup-nvcf.sh
Warning
The force cleanup script bypasses normal cleanup procedures by removing finalizers. Always attempt the ordered uninstall steps above first.
For the full script, download link, and detailed usage instructions, see the NVCA Force Cleanup Script appendix in the self-hosted troubleshooting guide.
Troubleshooting#
Bootstrap init container failed: Check the bootstrap logs to see why registration failed:
kubectl logs -n nvca-operator -c nvca-self-managed-bootstrap \ -l app.kubernetes.io/name=nvca-operator
ConfigMap shows empty cluster IDs after install: The vault token may not have been available when the init container ran. Run the bootstrap manually (see Manual Cluster Registration above).
Operator pod not starting: Check the operator logs:
kubectl logs -n nvca-operator -l app.kubernetes.io/name=nvca-operator -c nvca-operator --tail=100
NVCA agent pod not created: The operator creates the agent pod via the NVCFBackend resource. Check the operator logs for reconciliation errors:
kubectl describe nvcfbackends -n nvca-operator
Agent fails to register with SIS (HTTP 401): Check the bootstrap registration ConfigMap for populated cluster IDs. If they are empty, run the bootstrap manually. Also verify the vault agent sidecar on the agent pod is running and rendering the secrets file:
kubectl logs -n nvca-system -l app.kubernetes.io/name=nvca -c vault-agent --tail=10
Vault agent sidecar failing: The agent pod needs to authenticate with OpenBao. Verify the vault system is healthy:
kubectl exec -n vault-system openbao-server-0 -- bao status
No GPUs discovered: Ensure the GPU Operator is installed and GPU nodes have the
nvidia.com/gpuresource advertised:kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"