Helmfile Installation
This section covers the installation of the NVCF control plane components, which are required for all self-hosted NVCF deployments.
By default, the NVCF self-hosted stack is deployed using the provided Helmfile as described here. However, you can also install each Helm chart
individually using helm install or helm upgrade (see self-hosted-standalone-deployment).
This guide assumes you have already downloaded and extracted the nvcf-self-managed-stack helmfile bundle (see download-nvcf-self-managed-stack). All commands in this guide are run from inside the extracted nvcf-self-managed-stack/ directory unless otherwise noted. The directory contains the helmfile definitions, environment templates, and sample configurations referenced throughout.
Namespace Requirements
Each Helm chart in the NVCF stack must be installed into a specific namespace. These namespace assignments are fixed and must not be changed — service-to-service cluster DNS addressing and Vault (OpenBao) authentication claims depend on this layout.
Installing a chart into the wrong namespace will cause authentication failures such as
error validating claims: claim "/kubernetes.io/namespace" does not match any associated bound claim values.
If you see this error, verify that every release is deployed in the namespace shown above.
Prerequisites
Required Tools and Software
The following tools must be installed on your deployment machine:
- kubectl
- helm >= 3.12
- helmfile >= 1.1.0 (recommended:
1.1.x) - helm-diff plugin >=3.11
Avoid Helmfile 1.2.x. Helmfile 1.2.0 removed sequential execution mode, which the NVCF stack requires for ordered deployments. Use version 1.1.x for compatibility with the commands in this guide.
Helmfile 1.3.0+ re-introduced sequential execution via the --sequential-helmfiles flag, but the command syntax differs from the 1.1.x examples shown here. If you choose to use 1.3.0+, add --sequential-helmfiles to every helmfile apply and helmfile sync command.
- A kubernetes cluster (CSP agnostic or on-prem).
- Kubernetes Gateway CRDs installed (optional, required for Gateway API Ingress)
- Artifacts must be available in a registry that your Kubernetes cluster can access. This can be the
nvcf-onpremregistry for NVCF control plane service artifacts, but function containers and helm charts must be configured to a user-managed registry. See self-hosted-artifact-manifest and self-hosted-image-mirroring. - The
nvcf-self-managed-stackrepository must be downloaded to your local machine (see download-nvcf-self-managed-stack).
See terraform-installation for instructions on how to deploy a Kubernetes cluster on EKS or other CSPs if you don’t have one already.
Install Kubernetes Gateway CRDs
Install the Kubernetes Gateway API CRDs v1.2.0. Note if replacing the version (v1.2.0) with a different version, you may need to ensure compatability with the GatewayClass and Gateway resources created in Step 1.
Install helm-diff plugin
kubectl version must match your cluster (within one minor version). Using a kubectl version that is more than one minor version ahead of your Kubernetes cluster will cause kubectl apply and kubectl patch commands to fail — not just warn — due to stricter server-side field validation in newer clients.
This is especially common on macOS with Homebrew, where brew install kubectl or brew upgrade can silently install a version much newer than your cluster. Verify before proceeding:
If your client is too new, install a matching version directly from the Kubernetes release page.
Access Requirements
-
kubectl configured to the kubernetes cluster you are deploying to
-
Personal NGC API Key from ngc.nvidia.com authenticated with
nvcf-onpremorganization only if you pull artifacts directly from NGC or use NGC as your registry -
Registry credentials for your container registry (ECR, NGC, etc.) - see third-party-registries-self-hosted for setup instructions
-
Local Helm/Docker authentication to your container registry where NVCF charts are stored. Helmfile pulls OCI charts during deployment, so your local environment must be authenticated. Examples:
- AWS ECR:
aws ecr get-login-password --region <region> | helm registry login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com - NGC:
docker login nvcr.io -u '$oauthtoken' -p <NGC_API_KEY> - Other registries: Use
docker loginorhelm registry loginas appropriate for your registry
- AWS ECR:
If you are using NGC as your registry, you will use your NGC API key when generating the base64 registry credential in Step 3. Exporting NGC_API_KEY is optional and only needed if you prefer to reuse it in commands.
Installation Steps
The installation flow is as follows.
- Prepare ingress configuration
- Configure your environment file (
environments/<environment-name>.yaml) - Configure your secrets file (
secrets/<environment-name>-secrets.yaml) - Configure image pull secrets (skip if using a CSP registry with built-in credential helpers)
- Deploy the NVCF control plane components
- Verify the installation
Step 1. Prepare ingress configuration
- First, create the required namespaces for NVCF components:
- Next, label the namespaces for NVCF platform identification:
- Install Envoy Gateway:
- Create the GatewayClass resource:
- Create the Gateway resource:
The annotations section below is cloud-provider specific and controls how the external load balancer is provisioned. Choose the appropriate annotations for your environment:
- AWS (EKS): Creates an internet-facing Network Load Balancer
- GCP (GKE): Creates an external HTTP(S) load balancer
- Azure (AKS): Creates a public load balancer
- On-prem: Requires a load balancer solution like MetalLB, or use NodePort/Ingress. Consult your infrastructure documentation.
- Verify the Gateway is ready:
- Proceed to Step 2. Ensure you have your GATEWAY_ADDR ready to use in your environment configuration.
The Gateway address is embedded throughout your deployment. The domain value in your environment file, the Gateway API HTTPRoutes/TCPRoutes, and service discovery all depend on this address. If the Gateway or its underlying load balancer is deleted and recreated (e.g., due to a TCPRoute misconfiguration), a new address will be assigned.
If the address changes after deployment, you must update the domain in your environment file and re-sync the affected releases. See [Recovering from Gateway Address Changes] for the procedure.
The Gateway you created here will be used by the nvcf-gateway-routes chart to create HTTPRoutes and TCPRoutes for NVCF services. For details on how routing works, verification commands, and production DNS/HTTPS setup, see gateway-routing.
Step 2. Configure your environment file (environments/<environment-name>.yaml)
Environment configuration files define how NVCF is deployed in your specific environment. They are YAML files that provide values to the Helm charts.
Create your environment file from the template below (cp-env-eks-example.yaml).
Configuration Template (Amazon EKS Environment)
The following example shows a typical configuration for Amazon EKS:
domain and ingress Configuration
The domain and ingress sections of the environment file are used to configure the external access to the NVCF control plane.
If using the above example directly for EKS, you would replace the GATEWAY_ADDR with the actual ELB domain you obtained in Step 1.
If using the above example directly for EKS, your ingress configuration would look like this:
nodeSelectors Configuration
The nodeSelectors section of the environment file is used to configure the nodes on which the NVCF control plane components are deployed. Disable this unless you have a cluster with node selectors pre-configured on node pools within your cluster.
If using nvcf-base to create your cluster, you would enable this section with the following configuration:
cassandra Resource Tuning
The default Cassandra resource limits may be insufficient for clusters with large instance types (e.g., p5.48xlarge), causing Cassandra pods to be OOM-killed during initialization. If you observe Cassandra pods restarting with OOMKilled status, increase the Cassandra resource requests and limits using a Helmfile release values override (see overriding-helm-chart-values).
Add a values block to the cassandra release in helmfile.d/01-dependencies.yaml.gotmpl:
Then apply the change to just Cassandra:
When overriding values on a release that uses <<: *dependency, you must re-include global.yaml.gotmpl and the secrets file in your values list because YAML merge replaces lists entirely. Adjust CPU and memory values to suit your workload.
helm and image Configuration
The helm and image sections tell NVCF which registries to pull Helm charts and container images from.
helm.sources: The OCI registry where NVCF Helm charts are stored. Helmfile pulls charts from here at deploy time (requires local authentication — see [Access Requirements]).image: The container registry where NVCF service images are stored. Kubernetes pulls images from here at runtime.
If you have mirrored NVCF artifacts to your own registry (e.g., ECR), update both helm.sources and image to point to your mirror. See self-hosted-image-mirroring for details on mirroring artifacts.
When upgrading to a new nvcf-self-managed-stack version, you must re-mirror all artifacts before running helmfile sync. Each stack release may introduce new or updated container images and Helm charts. If these are not present in your private registry, pods will fail with ImagePullBackOff. Check the self-hosted-artifact-manifest for the complete list of required artifacts and versions.
Pulling directly from NGC is the recommended approach and avoids the need to manually mirror artifacts on every upgrade. If your environment permits it, configure helm.sources and image to point to the NGC registry (nvcr.io) and use your NGC API key for authentication. This ensures you always have access to the latest artifacts without additional mirroring steps.
These settings control where images are pulled from, not how Kubernetes authenticates to pull them. If your image registry is private, you may also need to configure image pull secrets — see Step 4.
Quick Start Summary: If you are using the example EKS environment YAML directly, used nvcf-base to create your cluster, and followed the ingress setup from Step 1, you only need to change:
domain: ReplaceGATEWAY_ADDRwith the load balancer address from Step 1helm.sources.registryandhelm.sources.repository: Point to your Helm chart registryimage.registryandimage.repository: Point to your container image registry
Overriding Helm Chart Values
Overriding Helm Chart Values
The environment file (environments/<environment-name>.yaml) controls global settings like domain, image, and nodeSelectors. However, you may need to override values for a specific Helm chart — for example, to increase Cassandra memory limits or change an image tag for one service.
Helmfile releases support a values property that passes values through to the underlying helm install/helm upgrade command. To add chart-specific overrides, edit the release definition in the appropriate file under helmfile.d/ and add a values block:
When a release inherits from a template (<<: *dependency), specifying values on the release replaces the template’s values list (YAML merge does not append lists). You must re-include global.yaml.gotmpl and the secrets file.
The values block is a list of YAML mappings. Keys correspond to the chart’s values.yaml structure. For example, to override a deeply nested value:
Values defined here take the highest precedence, overriding both the environment file and global.yaml.gotmpl. Use helmfile template to preview the rendered manifests after adding overrides, then apply to a single release:
Step 3. Configure your secrets file (secrets/<environment-name>-secrets.yaml)
Secrets configuration contains any sensitive data required for NVCF operation. The image pull secret credentials you insert here will be used to bootstrap the NVCF API with registry credentials for all worker components (function sidecars), function containers and helm charts.
These credentials will then be used for function deployments. Note that if the registry credentials are not correct you can always update them using the steps in third-party-registries-self-hosted.
Create your secrets file from the template below (example-secrets.yaml). You must replace all instances of REPLACE_WITH_BASE64_DOCKER_CREDENTIAL with your actual base64-encoded registry credentials.
NVCF supports these registries for function containers (set in api.accountBootstrap.registryCredentials): ACR (Azure), ECR (AWS), NVCR (NVIDIA), VolcEngine CR, JFrog/Artifactory, and Harbor.
Generating Base64-encoded Registry Credentials
Registry credentials must be base64-encoded in the format username:password. For detailed instructions on setting up credentials for specific registries (including IAM user creation for ECR), see third-party-registries-self-hosted.
NGC Registry
Amazon ECR
VolcEngine CR
Step 4. Configure image pull secrets (conditional)
Skip this step if you have mirrored NVCF artifacts to a CSP-managed registry (e.g., ECR) and are using a CSP-managed registry with built-in credential helpers (e.g., AWS ECR with IAM node roles, GKE Artifact Registry with Workload Identity, Azure ACR with managed identity). Kubernetes can pull images automatically in those environments.
The secrets file you configured in Step 3 handles API bootstrap registry credentials — these allow the NVCF API service to pull user function containers at runtime. Separately, Kubernetes itself needs image pull secrets to pull the NVCF control plane service images (API, SIS, Cassandra, etc.) from your registry.
If your image registry is private and your cluster nodes do not have built-in credential helpers, you must create Kubernetes docker-registry secrets in each NVCF namespace and configure the helmfile to reference them.
1. Create the pull secret in each NVCF namespace (create-nvcr-pull-secrets.sh):
For registries other than NGC, replace --docker-server, --docker-username, and --docker-password with your registry credentials.
2. Reference the secret in your helmfile environment. The helmfile propagates imagePullSecrets to all NVCF charts automatically. Add the secret name to your environment YAML (e.g. environments/<your-env>.yaml):
This replaces any need for a separate admission controller or policy engine to inject pull secrets.
Step 5. Deploy the NVCF control plane components
Set kubectl context to your cluster.
Ensure your local environment is authenticated to the container registry where your NVCF Helm charts are stored (see [Access Requirements]). Helmfile pulls OCI charts during deployment and will fail if not authenticated.
Before deploying, preview the rendered Kubernetes manifests:
This command will:
- Render all Helm charts with your environment and secrets
- Run validation hooks
- Display the resulting Kubernetes manifests
Review the output carefully to ensure:
- Container image references are correct
- Storage classes match your clusters
Deploy the self-managed stack:
The initial deployment takes approximately 5-10 minutes for local development and 10-20 minutes for cloud deployments.
Deployment Progresssion and Monitoring
Helmfile will deploy services in the correct order with dependencies:
Phase 1: Dependency Layer (5-10 minutes)
- NATS messaging service
- OpenBao (secrets management)
- Cassandra (database)
- Helmfile Selector:
release-group=dependencies
Phase 2: Control Plane Services (5-10 minutes)
- NVCF API Service
- SIS (Spot Instance Service)
- gRPC Proxy
- Invocation Service
- API Keys Service
- ESS API
- Notary Service
- Admin Issuer Proxy
- Helmfile Selector:
release-group=services
Monitor for account bootstrap failures: Once helmfile reaches Phase 3, open a separate terminal and watch events in the nvcf namespace:
The account bootstrap job runs as a post-install hook and is the most common failure point (usually due to environment or secrets misconfiguration). If it fails, see [Recovering from Partial Deployments] for recovery steps.
Phase 3: Ingress Configuration (1-2 minutes)
- Gateway API Routes (if enabled)
- Helmfile Selector:
release-group=ingress
Phase 4: (Optional) GPU Operator (1-2 minutes)
- Fake GPU Operator (optional, for development environments)
- Helmfile Selector:
release-group=workers
Open a separate terminal to monitor the deployment progress:
Monitor Each Deployment Phase:
Cassandra initialization pods showing “Error” is expected. The cassandra-initialize-cluster
job runs multiple pods in parallel and retries on failure. It is normal to see one or more pods
with Error status. The deployment is healthy as long as at least one initialization pod
reaches Completed and the cassandra-migrations job completes successfully.
If any pod remains in Pending, ContainerCreating, or ImagePullBackOff state for more than 5 minutes, see self-hosted-troubleshooting for issue identification commands and solutions.
Recovering from Partial Deployments
Do not attempt to fix a partially failed deployment by re-running helmfile sync or helmfile apply. Helm releases in a failed state will skip initialization hooks on subsequent runs, leading to incomplete deployments that appear successful but don’t function correctly.
Redeploying Dependencies (if needed):
If a dependency service (Cassandra, NATS, OpenBao) fails or gets stuck, you can safely redeploy it individually:
Recovering from Services Failures (without destroying dependencies):
If the release-group=services deployment hangs or fails (for example, account bootstrap failure due to secrets misconfiguration), you can recover without destroying your dependencies.
1. Monitor for failures:
In a separate terminal, watch events in the nvcf namespace:
2. Check the account bootstrap logs (if it failed):
The bootstrap job auto-deletes after ~5 minutes. Monitor events to catch failures in real-time.
3. Check the NVCF API logs for detailed error messages:
4. Fix the root cause (e.g., correct your secrets/<environment-name>-secrets.yaml file).
5. Destroy the services and downstream releases:
6. Clean up the service namespaces:
7. Recreate namespaces and labels (required for Gateway API routing):
8. Re-sync services (this triggers fresh post-install hooks):
9. Sync remaining releases after services succeed:
Full Restart (if dependencies are also broken):
If dependencies are corrupted or you prefer a clean slate, follow the complete [Uninstalling] steps, fix your configuration, then redeploy from Step 1.
Recovering from Gateway Address Changes
If your Gateway or its underlying load balancer was deleted and recreated (e.g., due to a TCPRoute misconfiguration or infrastructure change), the external address will change. Services that depend on the domain value — including Gateway API routes, SIS cluster registration, and API hostname resolution — will break until the new address is propagated.
1. Get the new Gateway address:
2. Update your environment file with the new address:
3. Re-sync ingress and services that depend on the domain:
4. Verify routes are using the new address:
If you encounter issues during deployment, consult the self-hosted-troubleshooting guide for common problems and solutions.
Step 6: Verify the Installation
Verify the installation is successful by checking the pods are running and the helm releases are successful.
Verify API Connectivity (Optional)
If you configured Gateway API ingress, you can verify the NVCF API is accessible by running the following commands.
1. Set up environment variables:
2. Generate an admin token:
3. List functions (should be empty initially):
Next Steps
After the control plane installation is successfully complete, proceed to self-managed-clusters to set up GPU cluster operations.
Uninstalling
This will delete all NVCF resources including data stored in persistent volumes. Ensure you have backups of any important data.
To remove the NVCF installation:
After helmfile destroy completes, clean up the namespaces:
To also remove the Gateway infrastructure created in Step 1: