Helmfile Installation#

This section covers the installation of the NVCF control plane components, which are required for all self-hosted NVCF deployments.

By default, the NVCF self-hosted stack is deployed using the provided Helmfile as described here. However, you can also install each Helm chart individually using helm install or helm upgrade (see Helm Chart Installation).

Important

This guide assumes you have already downloaded and extracted the nvcf-self-managed-stack helmfile bundle (see Downloading nvcf-self-managed-stack). All commands in this guide are run from inside the extracted nvcf-self-managed-stack/ directory unless otherwise noted. The directory contains the helmfile definitions, environment templates, and sample configurations referenced throughout.

cd path/to/nvcf-self-managed-stack
ls
# Expected contents: helmfile.d/  environments/  secrets/  global.yaml.gotmpl  ...

Namespace Requirements#

Each Helm chart in the NVCF stack must be installed into a specific namespace. These namespace assignments are fixed and must not be changed — service-to-service cluster DNS addressing and Vault (OpenBao) authentication claims depend on this layout.

Namespace

Services

nvcf

api, invocation-service, grpc-proxy, notary-service, reval, state-metrics

api-keys

api-keys, admin-issuer-proxy

ess

ess-api

sis

sis

nvca-operator

nvca-operator

vault-system

openbao-server

cassandra-system

cassandra

nats-system

nats

envoy-gateway-system

ingress (nvcf-gateway-routes)

Warning

Installing a chart into the wrong namespace will cause authentication failures such as error validating claims: claim "/kubernetes.io/namespace" does not match any associated bound claim values. If you see this error, verify that every release is deployed in the namespace shown above.

Prerequisites#

Required Tools and Software#

The following tools must be installed on your deployment machine:

  • kubectl

  • helm >= 3.12

  • helmfile >= 1.1.0 (recommended: 1.1.x)

  • helm-diff plugin >=3.11

Warning

Avoid Helmfile 1.2.x. Helmfile 1.2.0 removed sequential execution mode, which the NVCF stack requires for ordered deployments. Use version 1.1.x for compatibility with the commands in this guide.

Helmfile 1.3.0+ re-introduced sequential execution via the --sequential-helmfiles flag, but the command syntax differs from the 1.1.x examples shown here. If you choose to use 1.3.0+, add --sequential-helmfiles to every helmfile apply and helmfile sync command.

  • A kubernetes cluster (CSP agnostic or on-prem).

  • Kubernetes Gateway CRDs installed (optional, required for Gateway API Ingress)

  • Artifacts must be available in a registry that your Kubernetes cluster can access. This can be the nvcf-onprem registry for NVCF control plane service artifacts, but function containers and helm charts must be configured to a user-managed registry. See Artifact Manifest and Image Mirroring.

  • The nvcf-self-managed-stack repository must be downloaded to your local machine (see Downloading nvcf-self-managed-stack).

Note

See EKS Cluster Terraform (Optional) for instructions on how to deploy a Kubernetes cluster on EKS or other CSPs if you don’t have one already.

Install Kubernetes Gateway CRDs

Install the Kubernetes Gateway API CRDs v1.2.0. Note if replacing the version (v1.2.0) with a different version, you may need to ensure compatability with the GatewayClass and Gateway resources created in Step 1.

# Replace with desired version
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/experimental-install.yaml
Install helm-diff plugin
# Install helm-diff plugin (required for helmfile)
helm plugin install https://github.com/databus23/helm-diff

Warning

kubectl version must match your cluster (within one minor version). Using a kubectl version that is more than one minor version ahead of your Kubernetes cluster will cause kubectl apply and kubectl patch commands to fail – not just warn – due to stricter server-side field validation in newer clients.

This is especially common on macOS with Homebrew, where brew install kubectl or brew upgrade can silently install a version much newer than your cluster. Verify before proceeding:

kubectl version
# Ensure the Client Version and Server Version are within one minor version of each other.
# Example: Client v1.32.x against Server v1.31.x is OK.
#          Client v1.32.x against Server v1.29.x will cause failures.

If your client is too new, install a matching version directly from the Kubernetes release page.

Access Requirements#

  • kubectl configured to the kubernetes cluster you are deploying to

  • Personal NGC API Key from ngc.nvidia.com authenticated with nvcf-onprem organization only if you pull artifacts directly from NGC or use NGC as your registry

  • Registry credentials for your container registry (ECR, NGC, etc.) - see Working with Third-Party Registries for setup instructions

  • Local Helm/Docker authentication to your container registry where NVCF charts are stored. Helmfile pulls OCI charts during deployment, so your local environment must be authenticated. Examples:

    • AWS ECR: aws ecr get-login-password --region <region> | helm registry login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

    • NGC: docker login nvcr.io -u '$oauthtoken' -p <NGC_API_KEY>

    • Other registries: Use docker login or helm registry login as appropriate for your registry

Note

If you are using NGC as your registry, you will use your NGC API key when generating the base64 registry credential in Step 3. Exporting NGC_API_KEY is optional and only needed if you prefer to reuse it in commands.

Installation Steps#

The installation flow is as follows.

  1. Prepare ingress configuration

  2. Configure your environment file (environments/<environment-name>.yaml)

  3. Configure your secrets file (secrets/<environment-name>-secrets.yaml)

  4. Configure image pull secrets (skip if using a CSP registry with built-in credential helpers)

  5. Deploy the NVCF control plane components

  6. Verify the installation

Step 1. Prepare ingress configuration#

  1. First, create the required namespaces for NVCF components:

kubectl create namespace envoy-gateway-system && \
kubectl create namespace envoy-gateway && \
kubectl create namespace api-keys && \
kubectl create namespace ess && \
kubectl create namespace sis && \
kubectl create namespace nvcf
  1. Next, label the namespaces for NVCF platform identification:

kubectl label namespace envoy-gateway nvcf/platform=true && \
kubectl label namespace api-keys nvcf/platform=true && \
kubectl label namespace sis nvcf/platform=true && \
kubectl label namespace ess nvcf/platform=true && \
kubectl label namespace nvcf nvcf/platform=true
  1. Install Envoy Gateway:

helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.1.3 \
  -n envoy-gateway-system
  1. Create the GatewayClass resource:

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: eg
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
EOF
  1. Create the Gateway resource:

Note

The annotations section below is cloud-provider specific and controls how the external load balancer is provisioned. Choose the appropriate annotations for your environment:

  • AWS (EKS): Creates an internet-facing Network Load Balancer

  • GCP (GKE): Creates an external HTTP(S) load balancer

  • Azure (AKS): Creates a public load balancer

  • On-prem: Requires a load balancer solution like MetalLB, or use NodePort/Ingress. Consult your infrastructure documentation.

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: nvcf-gateway
  namespace: envoy-gateway
  annotations:
    # --- AWS (EKS) ---
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    # --- GCP (GKE) - use these instead for GCP ---
    # cloud.google.com/load-balancer-type: "External"
    # --- Azure (AKS) - use these instead for Azure ---
    # service.beta.kubernetes.io/azure-load-balancer-internal: "false"
spec:
  gatewayClassName: eg
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            nvcf/platform: "true"
  - name: tcp
    protocol: TCP
    port: 10081
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            nvcf/platform: "true"
EOF
  1. Verify the Gateway is ready:

# Check Gateway status
kubectl get gateway nvcf-gateway -n envoy-gateway

# Wait for PROGRAMMED=True and ADDRESS to appear
kubectl wait --for=condition=Programmed gateway/nvcf-gateway -n envoy-gateway --timeout=300s

# Get the NLB address
GATEWAY_ADDR=$(kubectl get gateway nvcf-gateway -n envoy-gateway -o jsonpath='{.status.addresses[0].value}')

echo "$GATEWAY_ADDR"
# e.g. abc123-4567890.us-west-2.elb.amazonaws.com
  1. Proceed to Step 2. Ensure you have your GATEWAY_ADDR ready to use in your environment configuration.

Warning

The Gateway address is embedded throughout your deployment. The domain value in your environment file, the Gateway API HTTPRoutes/TCPRoutes, and service discovery all depend on this address. If the Gateway or its underlying load balancer is deleted and recreated (e.g., due to a TCPRoute misconfiguration), a new address will be assigned.

If the address changes after deployment, you must update the domain in your environment file and re-sync the affected releases. See Recovering from Gateway Address Changes for the procedure.

See also

The Gateway you created here will be used by the nvcf-gateway-routes chart to create HTTPRoutes and TCPRoutes for NVCF services. For details on how routing works, verification commands, and production DNS/HTTPS setup, see Gateway Routing and DNS.

Step 2. Configure your environment file (environments/<environment-name>.yaml)#

Environment configuration files define how NVCF is deployed in your specific environment. They are YAML files that provide values to the Helm charts.

Create your environment file from the template below (cp-env-eks-example.yaml).

cd path/to/nvcf-self-managed-stack
touch environments/<environment-name>.yaml
# Copy the template into the file
Configuration Template (Amazon EKS Environment)

The following example shows a typical configuration for Amazon EKS:

environments/eks-example.yaml#
  1global:
  2
  3  # Domain for external access (used by Gateway API HTTPRoutes)
  4  domain: "GATEWAY_ADDR" # Replace with ELB domain
  5
  6  # =============================================================================
  7  # Helm Chart Sources Configuration
  8  # =============================================================================
  9  # Configure the OCI registry where NVCF Helm charts are stored.
 10  # This must point to a registry containing the NVCF chart packages.
 11  # =============================================================================
 12  helm:
 13    sources:
 14      registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
 15      repository: <your-ecr-repository-name> # if using nvcf-base, this will match the cluster name set in the terraform configuration
 16      # NGC Example:
 17      # registry: nvcr.io
 18      # repository: YOUR_ORG/YOUR_TEAM # e.g. 123456789102/YOUR_TEAM
 19      # ECR Example:
 20      # registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
 21      # repository: <your-ecr-repository-name>
 22
 23  # =============================================================================
 24  # Container Image Registry Configuration
 25  # =============================================================================
 26  # Configure the container registry where NVCF service images are stored.
 27  # These images are pulled by Kubernetes when deploying the NVCF stack.
 28  # =============================================================================
 29  image:
 30    registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
 31    repository: <your-ecr-repository-name> # if using nvcf-base, this will match the cluster name set in the terraform configuration
 32    # NGC Example:
 33    # registry: nvcr.io
 34    # repository: YOUR_ORG/YOUR_TEAM # e.g. 123456789102/YOUR_TEAM
 35    # ECR Example:
 36    # registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
 37    # repository: <your-ecr-repository-name>
 38
 39  nodeSelectors:
 40    enabled: true # If using nvcf-base to create EKS cluster, enabled: true
 41    vault:
 42      key: nvcf.nvidia.com/workload
 43      value: vault
 44    cassandra:
 45      key: nvcf.nvidia.com/workload
 46      value: cassandra
 47    controlplane:
 48      key: nvcf.nvidia.com/workload
 49      value: control-plane
 50
 51  storageClass: "gp3" # Customize to your storage class, if using nvcf-base gp3
 52  storageSize: "10Gi" # Customize to your storage size, if using nvcf-base 20Gi
 53
 54  # =============================================================================
 55  # Observability Configuration
 56  # =============================================================================
 57  # Enable distributed tracing via OTLP (dsiabled by default).
 58  # This must point to an OTLP-compatible collector.
 59  # =============================================================================
 60  observability:
 61    tracing:
 62      enabled: false
 63      collectorEndpoint: ""
 64      collectorPort: 4317
 65      collectorProtocol: http
 66      # Example:
 67      # enabled: true
 68      # collectorEndpoint: <your-collector-endpoint>
 69      # collectorPort: <your-collector-port>
 70      # collectorProtocol: <your-collector-protocol>
 71
 72fakeGpuOperator:
 73  enabled: false # If deploying locally with no GPUs, true
 74  ubuntu:
 75    imageName: alpine-k8s
 76    tag: 1.30.12
 77
 78accounts: # Default NVCF account configuration
 79  limits:
 80    maxFunctions: 10
 81    maxTasks: 10 # Note: Tasks (NVCT) are not currently supported for EA
 82    maxTelemetries: 10 # Note: BYOO is not currently supported for EA
 83    maxRegistryCreds: 10
 84
 85# These static global values are processed in the values template
 86nats:
 87  enabled: true
 88
 89cassandra:
 90  enabled: true
 91
 92openbao:
 93  enabled: true
 94  migrations:
 95    issuerDiscovery:
 96      enabled: true # Recommended true for EKS - discovers OIDC issuer automatically
 97
 98# Ingress Gateway Configuration
 99ingress:
100  gatewayApi:
101    enabled: true
102    controllerNamespace: "envoy-gateway-system" # must be set by the environment
103    routes:
104      nvcfApi:
105        routeAnnotations: {}
106      apiKeys:
107        routeAnnotations: {}
108      invocation:
109        routeAnnotations: {}
110      grpc:
111        routeAnnotations: {}
112    gateways:
113      shared:
114        name: "nvcf-gateway" # must be set by the environment
115        namespace: "envoy-gateway" # must be set by the environment
116        listenerName: http
117      grpc:
118        name: "nvcf-gateway" # must be set by the environment
119        namespace: "envoy-gateway" # must be set by the environment
120        listenerName: tcp

cp-env-eks-example.yaml

domain and ingress Configuration#

The domain and ingress sections of the environment file are used to configure the external access to the NVCF control plane.

If using the above example directly for EKS, you would replace the GATEWAY_ADDR with the actual ELB domain you obtained in Step 1.

domain: "GATEWAY_ADDR" # Replace with ELB domain

If using the above example directly for EKS, your ingress configuration would look like this:

ingress:
   gatewayApi:
      enabled: true
      controllerNamespace: "envoy-gateway-system"
      routes:
         nvcfApi:
            routeAnnotations: {}
         apiKeys:
            routeAnnotations: {}
         invocation:
            routeAnnotations: {}
         grpc:
            routeAnnotations: {}
      gateways:
         shared:
            name: "nvcf-gateway"
            namespace: "envoy-gateway"
            listenerName: http
         grpc:
            name: "nvcf-gateway"
            namespace: "envoy-gateway"
            listenerName: tcp

nodeSelectors Configuration#

The nodeSelectors section of the environment file is used to configure the nodes on which the NVCF control plane components are deployed. Disable this unless you have a cluster with node selectors pre-configured on node pools within your cluster.

If using nvcf-base to create your cluster, you would enable this section with the following configuration:

nodeSelectors:
  enabled: true
  vault:
    key: nvcf.nvidia.com/workload
    value: vault
  cassandra:
    key: nvcf.nvidia.com/workload
    value: cassandra
  controlplane:
    key: nvcf.nvidia.com/workload
    value: control-plane

cassandra Resource Tuning#

The default Cassandra resource limits may be insufficient for clusters with large instance types (e.g., p5.48xlarge), causing Cassandra pods to be OOM-killed during initialization. If you observe Cassandra pods restarting with OOMKilled status, increase the Cassandra resource requests and limits using a Helmfile release values override (see Overriding Helm Chart Values).

Add a values block to the cassandra release in helmfile.d/01-dependencies.yaml.gotmpl:

- name: cassandra
  version: 0.9.0
  condition: cassandra.enabled
  namespace: cassandra-system
  <<: *dependency
  values:
    - ../global.yaml.gotmpl
    - ../secrets/{{ requiredEnv "HELMFILE_ENV" }}-secrets.yaml
    - cassandra:
        resources:
          limits:
            cpu: "8"
            memory: 8192Mi
          requests:
            cpu: "2"
            memory: 4096Mi

Then apply the change to just Cassandra:

HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra sync

Note

When overriding values on a release that uses <<: *dependency, you must re-include global.yaml.gotmpl and the secrets file in your values list because YAML merge replaces lists entirely. Adjust CPU and memory values to suit your workload.

helm and image Configuration#

The helm and image sections tell NVCF which registries to pull Helm charts and container images from.

  • helm.sources: The OCI registry where NVCF Helm charts are stored. Helmfile pulls charts from here at deploy time (requires local authentication – see Access Requirements).

  • image: The container registry where NVCF service images are stored. Kubernetes pulls images from here at runtime.

# Helm Chart Sources Configuration
helm:
  sources:
    registry: "nvcr.io"
    repository: "YOUR_ORG/YOUR_TEAM"
    # NGC Example:
    # registry: nvcr.io
    # repository: 123456789102/YOUR_TEAM
    # ECR Example:
    # registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
    # repository: <your-ecr-repository-name>

# Container Image Registry Configuration
image:
  registry: nvcr.io
  repository: YOUR_ORG/YOUR_TEAM
  # NGC Example:
  # registry: nvcr.io
  # repository: 123456789102/YOUR_TEAM
  # ECR Example:
  # registry: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
  # repository: <your-ecr-repository-name>

Warning

If you have mirrored NVCF artifacts to your own registry (e.g., ECR), update both helm.sources and image to point to your mirror. See Image Mirroring for details on mirroring artifacts.

Note

These settings control where images are pulled from, not how Kubernetes authenticates to pull them. If your image registry is private, you may also need to configure image pull secrets – see Step 4.

Tip

Quick Start Summary: If you are using the example EKS environment YAML directly, used nvcf-base to create your cluster, and followed the ingress setup from Step 1, you only need to change:

  1. domain: Replace GATEWAY_ADDR with the load balancer address from Step 1

  2. helm.sources.registry and helm.sources.repository: Point to your Helm chart registry

  3. image.registry and image.repository: Point to your container image registry

Overriding Helm Chart Values

The environment file (environments/<environment-name>.yaml) controls global settings like domain, image, and nodeSelectors. However, you may need to override values for a specific Helm chart – for example, to increase Cassandra memory limits or change an image tag for one service.

Helmfile releases support a values property that passes values through to the underlying helm install/helm upgrade command. To add chart-specific overrides, edit the release definition in the appropriate file under helmfile.d/ and add a values block:

# Example: helmfile.d/01-dependencies.yaml.gotmpl
- name: cassandra
  version: 0.9.0
  condition: cassandra.enabled
  namespace: cassandra-system
  <<: *dependency
  values:
    - ../global.yaml.gotmpl
    - ../secrets/{{ requiredEnv "HELMFILE_ENV" }}-secrets.yaml
    - cassandra:
        resources:
          requests:
            cpu: "2"
            memory: 4096Mi
          limits:
            cpu: "8"
            memory: 8192Mi

Note

When a release inherits from a template (<<: *dependency), specifying values on the release replaces the template’s values list (YAML merge does not append lists). You must re-include global.yaml.gotmpl and the secrets file.

The values block is a list of YAML mappings. Keys correspond to the chart’s values.yaml structure. For example, to override a deeply nested value:

values:
  - api:
      image:
        tag: 2.223.9
      env:
        NVCF_REGISTRIES_ACCOUNT_PROVISIONING_ARTIFACT_TYPES: "CONTAINER,HELM"

Values defined here take the highest precedence, overriding both the environment file and global.yaml.gotmpl. Use helmfile template to preview the rendered manifests after adding overrides, then apply to a single release:

# Preview changes
HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra template

# Apply changes to just that release
HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra sync

Step 3. Configure your secrets file (secrets/<environment-name>-secrets.yaml)#

Secrets configuration contains any sensitive data required for NVCF operation. The image pull secret credentials you insert here will be used to bootstrap the NVCF API with registry credentials for all worker components (function sidecars), function containers and helm charts.

These credentials will then be used for function deployments. Note that if the registry credentials are not correct you can always update them using the steps in Working with Third-Party Registries.

Create your secrets file from the template below (example-secrets.yaml). You must replace all instances of REPLACE_WITH_BASE64_DOCKER_CREDENTIAL with your actual base64-encoded registry credentials.

cd path/to/nvcf-self-managed-stack
touch secrets/<environment-name>-secrets.yaml
# Copy the template into the file
Configuration Template
secrets/example-secrets.yaml#
 1# Required structure for any environment secrets.
 2# This is the minimal set of values to provide.
 3
 4# Notes:
 5# Cassandra:
 6#   The password should match the value set in the cassandra keyspace migrations
 7#
 8# API:
 9#   The value for the registry will be used in three places, as it is
10#   expected the same registry is used as a single source for all images.
11#     openbao.migrations.env[1].value
12#     api.accountBootstrap.registryCredentials[0].secret.value
13#     api.accountBootstrap.registryCredentials[1].secret.value
14
15openbao:
16  migrations:
17    env:
18      # Stored in OpenBao shared secrets (written by migration job)
19      - name: DEFAULT_CASSANDRA_PASSWORD
20        value: "ch@ng3m3"
21      # Stored in OpenBao KV for nvcf-api (written by migration job)
22      - name: NVCF_API_SIDECARS_IMAGE_PULL_SECRET
23        value: REPLACE_WITH_BASE64_DOCKER_CREDENTIAL # Replace with base64 credentials (ex. NGC / ECR / etc.) for your registry, refer to Working with Third-Party Registries.
24      - name: ADMIN_CLIENT_ID
25        value: ncp # <- keep this value
26
27api:
28  accountBootstrap:
29    registryCredentials:
30      - registryHostname: nvcr.io # ECR: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
31        secret:
32          name: nvcr-containers # ECR: ecr-containers
33          value: REPLACE_WITH_BASE64_DOCKER_CREDENTIAL # Replace with base64 credentials (ex. NGC / ECR / etc.) for your registry, refer to Working with Third-Party Registries.
34        artifactTypes: ["CONTAINER"]
35        tags: []
36        description: "NGC Container registry"
37      - registryHostname: helm.ngc.nvidia.com # ECR: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
38        secret:
39          name: nvcr-helmcharts # ECR: ecr-helmcharts
40          value: REPLACE_WITH_BASE64_DOCKER_CREDENTIAL # Replace with base64 credentials (ex. NGC / ECR / etc.) for your registry, refer to Working with Third-Party Registries.
41        artifactTypes: ["HELM"]
42        tags: []
43        description: "NGC Helm registry"

example-secrets.yaml

Note

NVCF supports these registries for function containers (set in api.accountBootstrap.registryCredentials): ACR (Azure), ECR (AWS), NVCR (NVIDIA), VolcEngine CR, JFrog/Artifactory, and Harbor.

Generating Base64-encoded Registry Credentials#

Registry credentials must be base64-encoded in the format username:password. For detailed instructions on setting up credentials for specific registries (including IAM user creation for ECR), see Working with Third-Party Registries.

# Replace YOUR_NGC_API_KEY with your actual personal NGC API key from ngc.nvidia.com
echo -n '$oauthtoken:YOUR_NGC_API_KEY' | base64 -w 0

For AWS ECR, NVCF requires permanent IAM credentials. You must first create a dedicated IAM user with ECR permissions. See Adding AWS ECR Registry Credentials for complete setup instructions.

Once you have created the IAM user and obtained the access keys:

# Replace with your IAM user's access key ID and secret access key
ACCESS_KEY_ID="<access-key-id>"
SECRET_ACCESS_KEY="<secret-access-key>"

echo -n "${ACCESS_KEY_ID}:${SECRET_ACCESS_KEY}" | base64 -w 0

Once you have your VolcEngine Access Key ID and Secret Access Key (see Adding Volcano Engine Container Registry Credentials for full details):

# Replace with your VolcEngine Access Key ID and Secret Access Key
ACCESS_KEY_ID="<access-key-id>"
SECRET_ACCESS_KEY="<secret-access-key>"

echo -n "${ACCESS_KEY_ID}:${SECRET_ACCESS_KEY}" | base64 -w 0

Step 4. Configure image pull secrets (conditional)#

Note

Skip this step if you have mirrored NVCF artifacts to a CSP-managed registry (e.g., ECR) and are using a CSP-managed registry with built-in credential helpers (e.g., AWS ECR with IAM node roles, GKE Artifact Registry with Workload Identity, Azure ACR with managed identity). Kubernetes can pull images automatically in those environments.

The secrets file you configured in Step 3 handles API bootstrap registry credentials – these allow the NVCF API service to pull user function containers at runtime. Separately, Kubernetes itself needs image pull secrets to pull the NVCF control plane service images (API, SIS, Cassandra, etc.) from your registry.

If your image registry is private and your cluster nodes do not have built-in credential helpers, you must create Kubernetes docker-registry secrets in each NVCF namespace and ensure every pod references them.

The recommended approach uses a mutating admission webhook (such as Kyverno) to automatically inject imagePullSecrets into every pod at admission time. This works for all Helm charts uniformly and requires no helmfile modifications. If you already have another mechanism for injecting image pull secrets (e.g., an existing admission controller, an operator, or a GitOps policy engine), use that instead and skip the Kyverno steps below.

1. Create the pull secret in each NVCF namespace (create-nvcr-pull-secrets.sh):

export NGC_API_KEY="<your-ngc-api-key>"

for ns in cassandra-system nats-system nvcf api-keys ess sis nvca-operator vault-system nvca-system; do
  kubectl create namespace "$ns" --dry-run=client -o yaml | kubectl apply -f -
done

for ns in cassandra-system nats-system nvcf api-keys ess sis nvca-operator vault-system nvca-system; do
  kubectl create secret docker-registry nvcr-pull-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_API_KEY" \
    --namespace="$ns" \
    --dry-run=client -o yaml | kubectl apply -f -
done

For registries other than NGC, replace --docker-server, --docker-username, and --docker-password with your registry credentials.

2. Ensure pods reference the secret. The simplest approach is a Kyverno ClusterPolicy that injects the secret into every pod in NVCF namespaces:

# Install Kyverno (if not already installed)
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno -n kyverno --create-namespace

# Apply the policy
kubectl apply -f kyverno-imagepullsecret-policy.yaml
Kyverno ClusterPolicy
kyverno-imagepullsecret-policy.yaml#
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: nvcf-add-imagepullsecrets
  annotations:
    policies.kyverno.io/title: Add Image Pull Secrets to NVCF Pods
    policies.kyverno.io/category: Image Security
    policies.kyverno.io/severity: medium
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/description: >-
      Automatically adds the nvcr-pull-secret to all pods in NVCF namespaces.
      This ensures all pods can pull images from private registries without
      per-chart imagePullSecrets configuration.
spec:
  background: false
  rules:
    - name: add-imagepullsecret-to-nvcf-pods
      match:
        any:
        - resources:
            kinds:
            - Pod
            namespaces:
            - "nvcf"
            - "api-keys"
            - "sis"
            - "ess"
            - "nvca-operator"
            - "nats-system"
            - "cassandra-system"
            - "vault-system"
            - "nvca-system"
      mutate:
        patchStrategicMerge:
          metadata:
            annotations:
              nvcf.nvidia.com/imagepullsecret-injected-by: kyverno
          spec:
            imagePullSecrets:
            - name: nvcr-pull-secret

kyverno-imagepullsecret-policy.yaml

Step 5. Deploy the NVCF control plane components#

Set kubectl context to your cluster.

Important

Ensure your local environment is authenticated to the container registry where your NVCF Helm charts are stored (see Access Requirements). Helmfile pulls OCI charts during deployment and will fail if not authenticated.

Before deploying, preview the rendered Kubernetes manifests:

cd path/to/nvcf-self-managed-stack
HELMFILE_ENV=<environment-name> helmfile template

This command will:

  1. Render all Helm charts with your environment and secrets

  2. Run validation hooks

  3. Display the resulting Kubernetes manifests

Important

Review the output carefully to ensure:

  • Container image references are correct

  • Storage classes match your clusters

Deploy the self-managed stack:

HELMFILE_ENV=<environment-name> helmfile sync

Note

The initial deployment takes approximately 5-10 minutes for local development and 10-20 minutes for cloud deployments.

Deployment Progresssion and Monitoring#

Helmfile will deploy services in the correct order with dependencies:

Phase 1: Dependency Layer (5-10 minutes)

  • NATS messaging service

  • OpenBao (secrets management)

  • Cassandra (database)

  • Helmfile Selector: release-group=dependencies

Phase 2: Control Plane Services (5-10 minutes)

  • NVCF API Service

  • SIS (Spot Instance Service)

  • gRPC Proxy

  • Invocation Service

  • API Keys Service

  • ESS API

  • Notary Service

  • Admin Issuer Proxy

  • Helmfile Selector: release-group=services

Important

Monitor for account bootstrap failures: Once helmfile reaches Phase 3, open a separate terminal and watch events in the nvcf namespace:

kubectl get events -n nvcf -w

The account bootstrap job runs as a post-install hook and is the most common failure point (usually due to environment or secrets misconfiguration). If it fails, see Recovering from Partial Deployments for recovery steps.

Phase 3: Ingress Configuration (1-2 minutes)

  • Gateway API Routes (if enabled)

  • Helmfile Selector: release-group=ingress

Phase 4: GPU Cluster Components (1-2 minutes)

  • NVCA Operator (Cluster Agent) for cluster registration and function deployment

  • Fake GPU Operator (optional, for development environments)

  • Helmfile Selector: release-group=workers

Tip

To customize NVCA feature flags (e.g., LogPosting, MultiNodeWorkloads) during installation, see Managing Feature Flags for helmfile-specific examples.

Open a separate terminal to monitor the deployment progress:

Monitor Each Deployment Phase:

# Phase 1: Check namespace creation and preparation
kubectl get ns

# Phase 2: Check dependency services (release-group=dependencies)
kubectl get pods -n nats-system        # Should see nats-0, nats-1, nats-2
kubectl get pods -n vault-system       # Should see openbao-server-0, openbao-server-1, openbao-server-2
kubectl get pods -n cassandra-system   # Should see cassandra-0, cassandra-1, cassandra-2
# Note: It's normal to see cassandra-initialize-cluster pods with "Error" status.
# The initialization job retries on failure - as long as one pod shows "Completed"
# and cassandra-migrations is Running/Completed, the deployment is progressing normally.

# Phase 3: Check control plane services (release-group=services)
kubectl get events -n nvcf -w       # Watch for account bootstrap failures
kubectl get pods -n nvcf            # API, invocation-service, grpc-proxy, notary-service
kubectl get pods -n sis             # Spot Instance Service
kubectl get pods -n api-keys        # API Keys service, admin-issuer-proxy
...

# Phase 4: Check ingress (release-group=ingress)
kubectl get httproutes -A          # Gateway API routes (if enabled)

# Phase 5: Check worker components (release-group=workers)
kubectl get pods -n nvca-operator   # Should see nvca-operator pod

Note

Cassandra initialization pods showing “Error” is expected. The cassandra-initialize-cluster job runs multiple pods in parallel and retries on failure. It is normal to see one or more pods with Error status. The deployment is healthy as long as at least one initialization pod reaches Completed and the cassandra-migrations job completes successfully.

Tip

If any pod remains in Pending, ContainerCreating, or ImagePullBackOff state for more than 5 minutes, see Troubleshooting / FAQ for issue identification commands and solutions.

Recovering from Partial Deployments#

Warning

Do not attempt to fix a partially failed deployment by re-running helmfile sync or helmfile apply. Helm releases in a failed state will skip initialization hooks on subsequent runs, leading to incomplete deployments that appear successful but don’t function correctly.

Redeploying Dependencies (if needed):

If a dependency service (Cassandra, NATS, OpenBao) fails or gets stuck, you can safely redeploy it individually:

# Redeploy only Cassandra
HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra apply

# Redeploy all dependencies (NATS, Cassandra, OpenBao)
HELMFILE_ENV=<environment-name> helmfile --selector release-group=dependencies apply

Recovering from Services Failures (without destroying dependencies):

If the release-group=services deployment hangs or fails (for example, account bootstrap failure due to secrets misconfiguration), you can recover without destroying your dependencies.

1. Monitor for failures:

In a separate terminal, watch events in the nvcf namespace:

kubectl get events -n nvcf -w

2. Check the account bootstrap logs (if it failed):

kubectl logs job/nvcf-api-account-bootstrap -n nvcf

Note

The bootstrap job auto-deletes after ~5 minutes. Monitor events to catch failures in real-time.

3. Check the NVCF API logs for detailed error messages:

kubectl logs -n nvcf -l app.kubernetes.io/name=nvcf-api --tail=100

4. Fix the root cause (e.g., correct your secrets/<environment-name>-secrets.yaml file).

5. Destroy the services and downstream releases:

# Destroy services release group
HELMFILE_ENV=<environment-name> helmfile --selector release-group=services destroy

# Destroy downstream releases (ingress, workers, admin-issuer-proxy)
HELMFILE_ENV=<environment-name> helmfile --selector release-group=ingress destroy
HELMFILE_ENV=<environment-name> helmfile --selector release-group=workers destroy
HELMFILE_ENV=<environment-name> helmfile --selector name=admin-issuer-proxy destroy

6. Clean up the service namespaces:

kubectl delete namespace nvcf api-keys ess sis nvca-operator nvca-system nvca-backend --ignore-not-found

7. Recreate namespaces and labels (required for Gateway API routing):

kubectl create namespace api-keys && \
kubectl create namespace ess && \
kubectl create namespace sis && \
kubectl create namespace nvcf

kubectl label namespace api-keys nvcf/platform=true && \
kubectl label namespace sis nvcf/platform=true && \
kubectl label namespace ess nvcf/platform=true && \
kubectl label namespace nvcf nvcf/platform=true

8. Re-sync services (this triggers fresh post-install hooks):

HELMFILE_ENV=<environment-name> helmfile --selector release-group=services sync

9. Sync remaining releases after services succeed:

HELMFILE_ENV=<environment-name> helmfile --selector name=admin-issuer-proxy sync
HELMFILE_ENV=<environment-name> helmfile --selector release-group=ingress sync
HELMFILE_ENV=<environment-name> helmfile --selector release-group=workers sync

Full Restart (if dependencies are also broken):

If dependencies are corrupted or you prefer a clean slate, follow the complete Uninstalling steps, fix your configuration, then redeploy from Step 1.

Recovering from Gateway Address Changes#

If your Gateway or its underlying load balancer was deleted and recreated (e.g., due to a TCPRoute misconfiguration or infrastructure change), the external address will change. Services that depend on the domain value – including Gateway API routes, SIS cluster registration, and API hostname resolution – will break until the new address is propagated.

1. Get the new Gateway address:

GATEWAY_ADDR=$(kubectl get gateway nvcf-gateway -n envoy-gateway -o jsonpath='{.status.addresses[0].value}')
echo "$GATEWAY_ADDR"

2. Update your environment file with the new address:

# Edit environments/<environment-name>.yaml
# Change: domain: "OLD_ADDRESS"
# To:     domain: "NEW_GATEWAY_ADDR"

3. Re-sync ingress and services that depend on the domain:

# Re-sync gateway routes (picks up new domain)
HELMFILE_ENV=<environment-name> helmfile --selector release-group=ingress sync

# Re-sync services that embed the domain (API, admin-issuer-proxy)
HELMFILE_ENV=<environment-name> helmfile --selector release-group=services sync
HELMFILE_ENV=<environment-name> helmfile --selector name=admin-issuer-proxy sync

4. Verify routes are using the new address:

kubectl get httproutes -A
kubectl get tcproutes -A

Tip

If you encounter issues during deployment, consult the Troubleshooting / FAQ guide for common problems and solutions.

Step 6: Verify the Installation#

Verify the installation is successful by checking the pods are running and the helm releases are successful.

# View all pods with node assignment and status, should all be Running or Completed state
kubectl get pods -A -o wide

# Check helm releases status
helm list -A

Verify API Connectivity (Optional)#

If you configured Gateway API ingress, you can verify the NVCF API is accessible by running the following commands.

1. Set up environment variables:

# Get the Gateway address (from Step 1)
export GATEWAY_ADDR=$(kubectl get gateway nvcf-gateway -n envoy-gateway -o jsonpath='{.status.addresses[0].value}')
echo "Gateway Address: $GATEWAY_ADDR"

2. Generate an admin token:

# Generate an admin API token
export NVCF_TOKEN=$(curl -s -X POST "http://${GATEWAY_ADDR}/v1/admin/keys" \
  -H "Host: api-keys.${GATEWAY_ADDR}" \
  | grep -o '"value":"[^"]*"' | cut -d'"' -f4)

echo "Token generated: ${NVCF_TOKEN:0:20}..."

3. List functions (should be empty initially):

# List all functions
curl -s -X GET "http://${GATEWAY_ADDR}/v2/nvcf/functions" \
  -H "Host: api.${GATEWAY_ADDR}" \
  -H "Authorization: Bearer ${NVCF_TOKEN}" | jq .

4. (Optional) Create, deploy, and invoke a test function:

# Create a test function
# Replace <YOUR_REGISTRY>/<YOUR_REPO> with your container registry
# This should match the registry you set in the secrets file
curl -s -X POST "http://${GATEWAY_ADDR}/v2/nvcf/functions" \
  -H "Host: api.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${NVCF_TOKEN}" \
  -d '{
    "name": "my-echo-function",
    "inferenceUrl": "/echo",
    "healthUri": "/health",
    "inferencePort": 8000,
    "containerImage": "<YOUR_REGISTRY>/<YOUR_REPO>/load_tester_supreme:0.0.8"
  }' | jq .

# Extract function and version IDs from the response
export FUNCTION_ID=<function-id-from-response>
export FUNCTION_VERSION_ID=<version-id-from-response>

# Deploy the function
# Adjust instanceType and gpu based on your cluster configuration
# Instance Type Examples: NCP.GPU.A10G_1x, NCP.GPU.H100_1x, NCP.GPU.L40S_1x, etc.
# GPU Examples: A10G, H100, L40S, etc.
curl -s -X POST "http://${GATEWAY_ADDR}/v2/nvcf/deployments/functions/${FUNCTION_ID}/versions/${FUNCTION_VERSION_ID}" \
  -H "Host: api.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${NVCF_TOKEN}" \
  -d '{
    "deploymentSpecifications": [
      {
        "instanceType": "NCP.GPU.A10G_1x",
        "backend": "nvcf-default",
        "gpu": "A10G",
        "maxInstances": 1,
        "minInstances": 1
      }
    ]
  }' | jq .

# Generate an API key for invocation (note the required scopes, the NVCF API Open API Spec under "API" page has all scopes documented per endpoint)
# Set expiration to 1 day from now (required field)
EXPIRES_AT=$(date -u -v+1d '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '+1 day' '+%Y-%m-%dT%H:%M:%SZ')
SERVICE_ID="nvidia-cloud-functions-ncp-service-id-aketm"

export API_KEY=$(curl -s -X POST "http://${GATEWAY_ADDR}/v1/keys" \
  -H "Host: api-keys.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Key-Issuer-Service: nvcf-api" \
  -H "Key-Issuer-Id: ${SERVICE_ID}" \
  -H "Key-Owner-Id: test@nvcf-api.local" \
  -d '{
    "description": "test invocation key",
    "expires_at": "'"${EXPIRES_AT}"'",
    "authorizations": {
      "policies": [{
        "aud": "'"${SERVICE_ID}"'",
        "auds": ["'"${SERVICE_ID}"'"],
        "product": "nv-cloud-functions",
        "resources": [
          {"id": "*", "type": "account-functions"},
          {"id": "*", "type": "authorized-functions"}
        ],
        "scopes": ["invoke_function", "list_functions", "queue_details", "list_functions_details"]
      }]
    },
    "audience_service_ids": ["'"${SERVICE_ID}"'"]
  }' | jq -r '.value')

echo "API Key: ${API_KEY:0:20}..."

# Wait for deployment to be ready (list functions to see status), then invoke the function
# Uses wildcard subdomain routing: <function-id>.invocation.<gateway-addr>
curl -s -X POST "http://${GATEWAY_ADDR}/echo" \
  -H "Host: ${FUNCTION_ID}.invocation.${GATEWAY_ADDR}" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"message": "hello world", "repeats": 1}' | jq .

Note

The backend value should match the cluster group name registered by the NVCA operator. The instanceType and gpu values depend on the GPU types available in your cluster.

For invocation, the Host header uses wildcard subdomain routing: <function-id>.invocation.<gateway-addr>. The URL path should match the function’s inferenceUrl (e.g., /echo).

Next Steps#

Now that your NVCF control plane is installed, you can use the NVCF CLI for easier function management:

  • Create, deploy, and invoke functions with simple commands

  • Create or update registry credentials without manual API calls

See Self-hosted CLI for installation and usage instructions.

Uninstalling#

Warning

This will delete all NVCF resources including data stored in persistent volumes. Ensure you have backups of any important data.

To remove the NVCF installation:

HELMFILE_ENV=<environment-name> helmfile destroy

After helmfile destroy completes, clean up the namespaces:

# Delete NVCF namespaces
kubectl delete namespace cassandra-system nats-system vault-system \
  nvcf api-keys ess sis nvca-operator nvca-system nvca-backend \
  --ignore-not-found

To also remove the Gateway infrastructure created in Step 1:

# Delete the Gateway and GatewayClass resources
kubectl delete gateway nvcf-gateway -n envoy-gateway --ignore-not-found
kubectl delete gatewayclass eg --ignore-not-found

# Uninstall Envoy Gateway
helm uninstall eg -n envoy-gateway-system

# Delete the gateway namespaces
kubectl delete namespace envoy-gateway envoy-gateway-system --ignore-not-found

# (Optional) Remove Gateway API CRDs if no longer needed
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/experimental-install.yaml

Handling Stuck Resources During Uninstall#

If helmfile destroy hangs or namespaces remain stuck in Terminating state, this is typically caused by finalizers on NVCA resources (NVCFBackends, function pods, etc.). Use the Appendix: NVCA Force Cleanup Script to remove stuck NVCA components.