Self-Managed Clusters
Self-Managed Clusters
GPU clusters are registered with the NVCF control plane using the NVCA Operator. The operator creates its configuration locally via Kubernetes ConfigMaps and authenticates through the local OpenBao (Vault) instance.
A running NVCF control plane (SIS, OpenBao, NATS, Cassandra, and all core services) is required. Install the control plane using either the standalone Helm chart installation or the Helmfile-based installation before proceeding.
Prerequisites
Before installing the NVCA Operator, ensure the following prerequisites are met:
-
The control plane is installed and all core services are running.
-
The NVIDIA GPU Operator is installed on the GPU cluster. The GPU Operator manages the NVIDIA drivers, device plugin, and GPU feature discovery required for workload scheduling. For development or testing environments without physical GPUs, see fake-gpu-operator.
-
The KAI Scheduler is installed on the GPU cluster. KAI Scheduler is required for optimized AI workload scheduling and bin-packing of GPU resources.
-
GPU Workload Componentsmust be available in a user-managed registry that your Kubernetes cluster can access. SeeGPU Workload Componentsunder self-hosted-artifact-manifest for necessary artifacts and self-hosted-image-mirroring for mirroring instructions. -
The SMB CSI driver (
smb.csi.k8s.io) must be installed on the GPU cluster. It is required for NVCA shared model cache storage (samba sidecar). Install it with:
How It Works
When ngcConfig.clusterSource is set to self-managed, the NVCA Operator uses a local
bootstrap process:
- The Helm chart creates a local ConfigMap (
nvcfbackend-self-managed) with cluster configuration from your values file (name, region, capabilities, SIS endpoint). - The
ngcConfig.serviceKeyfield is required by the chart schema but not used. Set it to any non-empty string. - An init container (
nvca-self-managed-bootstrap) runs before the operator starts. It reads the vault-injected SIS JWT token and registers the cluster with the local SIS service, writing the resultingclusterIdandclusterGroupIdto thenvca-cluster-registrationConfigMap. - The operator starts with the cluster already registered. It reads the IDs from the ConfigMap and creates the NVCFBackend resource.
- The NVCA agent pod is created by the operator. It authenticates with SIS using a vault-injected token and begins managing GPU workloads.
- Secrets (Cassandra credentials, registry pull secrets) are injected by the OpenBao vault agent sidecar, which authenticates using Kubernetes service account JWT tokens against the local OpenBao instance.
Installing the NVCA Operator
Configuration
Create nvca-operator-values.yaml (download template):
Replace all <REGISTRY> and <REPOSITORY> placeholders with your registry settings.
Key values:
If you are using node selectors, uncomment the nodeSelector section.
For the full list of available feature flags and how to set or modify them, see managing-feature-flags.
Image Pull Secrets
The NVCA operator, NVCA agent, samba sidecar, and image-credential-helper all pull container images from the registry configured in your values file. If that registry is private, you must create a Kubernetes pull secret and reference it in the Helm values so that all pods can authenticate.
The chart provides a generateImagePullSecret option, but this does not work in
self-managed mode. It generates a pull secret from ngcConfig.serviceKey, which is
set to a dummy value in self-managed deployments. Use imagePullSecrets with a
pre-existing secret instead.
1. Create the pull secret in the nvca-operator and nvcf-backend namespaces:
Replace <REGISTRY> with your container registry (e.g., nvcr.io). For non-NGC
registries, replace --docker-username and --docker-password with your registry
credentials. For NGC (nvcr.io), $REGISTRY_PASSWORD is your NGC Personal Key or
API Key.
The secret is needed in both namespaces because the operator and agent pods run in
nvca-operator, while deployed function pods (which include the samba and
image-credential-helper sidecars) run in nvcf-backend.
2. Reference the secret in your values file. Add the following to
nvca-operator-values.yaml:
The chart passes this secret to both the operator deployment and all NVCA agent pods it
creates. The operator also propagates it to function pods in the nvcf-backend
namespace.
Install
During installation, the chart will:
- Create the operator deployment with vault agent annotations for OpenBao auth
- Run the
nvca-self-managed-bootstrapinit container to register the cluster with SIS - Start the operator, which creates the NVCFBackend and NVCA agent deployment
Verify
Check the operator pod is running (4 containers: operator, nvca-mirror, nvca-bootstrap-watch, vault-agent):
Check the bootstrap init container completed successfully:
Check the cluster registration ConfigMap has IDs populated:
Check the NVCFBackend resource was created:
Check the NVCA agent pod is running (the operator creates this automatically):
The NVCA agent pod has 3 containers: the agent, the admission webhook, and the OpenBao vault agent sidecar.
Both should show Running. If the vault agent sidecar is in CrashLoopBackOff, verify
that OpenBao is healthy and the migration jobs completed successfully.
Verify GPU discovery:
Verify Workload Scheduling
1. Set up environment variables:
2. Generate an admin token:
3. Create, deploy, and invoke a test function:
The backend value should match the cluster group name registered by the NVCA operator.
The instanceType and gpu values depend on the GPU types available in your cluster.
For invocation, the Host header uses wildcard subdomain routing: <function-id>.invocation.<gateway-addr>.
The URL path should match the function’s inferenceUrl (e.g., /echo).
You can also use the NVCF CLI for easier function management:
- Create, deploy, and invoke functions with simple commands
- Create or update registry credentials without manual API calls
See self-hosted-cli for installation and usage instructions.
Manual Cluster Registration
The cluster bootstrap runs automatically as an init container during helm install. If
you need to re-run registration (for example, after a failed install or to update the
cluster registration), you can invoke the bootstrap CLI directly from the running operator
pod:
This command:
- Reads the cluster config from the
nvcfbackend-self-managedConfigMap - Authenticates with SIS using the vault-injected JWT token
- Registers the cluster (or re-discovers an existing registration)
- Updates the
nvca-cluster-registrationConfigMap with the cluster IDs
After manual registration, restart the operator so it re-reads the updated ConfigMap. The operator caches the cluster IDs at startup and does not watch the bootstrap ConfigMap for changes, so a restart is required:
Wait for the operator to come back up (the bootstrap init container will run again and confirm the existing registration), then verify the agent starts successfully:
Simply annotating the NVCFBackend to force a rollout is not sufficient after manual registration. The operator must be restarted to pick up the new cluster IDs from the ConfigMap.
Uninstalling
To fully remove the NVCA Operator and all associated resources:
If functions are currently deployed on the cluster (pods in the nvcf-backend namespace),
undeploy them through the NVCF API or CLI before uninstalling the operator. Attempting
to delete NVCA while function pods are running can cause finalizers to block namespace
deletion. If you encounter stuck resources, see [Handling Stuck Resources] below.
-
Delete the NVCFBackend resource (triggers operator-managed cleanup of the agent deployment, NVCA system pods, and related resources):
-
Verify the agent namespace is clean before proceeding:
-
Uninstall the Helm release:
Helm will report that the
nvca-cluster-registrationConfigMap was kept due to resource policy. This is intentional — it preserves the cluster registration IDs so that a reinstall can reuse them. Delete it manually if you want a completely clean removal. -
Clean up the retained ConfigMap (optional — skip if you plan to reinstall):
-
Delete CRDs (removes all NVCFBackend, MiniService, and StorageRequest custom resources cluster-wide):
-
Delete namespaces:
Handling Stuck Resources
If step 1 times out and namespaces remain stuck in Terminating state, or function pods in
nvcf-backend prevent cleanup, use the force-cleanup-script. This script removes
finalizers on stuck NVCA resources, force-deletes function pods, and cleans up all NVCA
namespaces.
The force cleanup script bypasses normal cleanup procedures by removing finalizers. Always attempt the ordered uninstall steps above first.
For the full script, download link, and detailed usage instructions, see the NVCA Force Cleanup Script appendix in the self-hosted troubleshooting guide.
Troubleshooting
-
Bootstrap init container failed: Check the bootstrap logs to see why registration failed:
-
ConfigMap shows empty cluster IDs after install: The vault token may not have been available when the init container ran. Run the bootstrap manually (see [Manual Cluster Registration] above).
-
Operator pod not starting: Check the operator logs:
-
NVCA agent pod not created: The operator creates the agent pod via the NVCFBackend resource. Check the operator logs for reconciliation errors:
-
Agent fails to register with SIS (HTTP 401): Check the bootstrap registration ConfigMap for populated cluster IDs. If they are empty, run the bootstrap manually. Also verify the vault agent sidecar on the agent pod is running and rendering the secrets file:
-
Vault agent sidecar failing: The agent pod needs to authenticate with OpenBao. Verify the vault system is healthy:
-
No GPUs discovered: Ensure the GPU Operator is installed and GPU nodes have the
nvidia.com/gpuresource advertised: