Infrastructure | NVIDIA Cloud Functions

This phase installs the three infrastructure services that all NVCF core services depend on: NATS (messaging), OpenBao (secrets management), and Cassandra (persistence).

Complete all steps in standalone-prerequisites before proceeding. You should have your shell variables (REGISTRY, REPOSITORY, STORAGE_CLASS, STORAGE_SIZE, CASSANDRA_PASSWORD, REGISTRY_CREDENTIAL_B64) exported and namespaces created.

NATS

NATS provides the messaging backbone for inter-service communication across all NVCF components.

Chart	`helm-nvcf-nats`
Version	`0.5.0`
Namespace	`nats-system`
Depends on	None

Configuration

Create nats-values.yaml with your registry settings (download template):

nats-values.yaml

1 # NATS values for standalone installation
2 # Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3 #
4 # Example:
5 #   REGISTRY: nvcr.io
6 #   REPOSITORY: YOUR_ORG/YOUR_TEAM
7 
8 nats:
9   container:
10     image:
11       registry: "<REGISTRY>"
12       repository: "<REPOSITORY>/nats-server"
13 
14   reloader:
15     image:
16       registry: "<REGISTRY>"
17       repository: "<REPOSITORY>/nats-server-config-reloader"
18 
19   natsBox:
20     container:
21       image:
22         registry: "<REGISTRY>"
23         repository: "<REPOSITORY>/nats-box"
24 
25   nkeyJob:
26     image:
27       registry: "<REGISTRY>"
28       repository: "<REPOSITORY>/alpine-k8s"
29 
30   # Uncomment and set if using node selectors
31   # podTemplate:
32   #   merge:
33   #     spec:
34   #       nodeSelector:
35   #         nvcf.nvidia.com/workload: control-plane
36 
37   # Uncomment and set to configure storage class for JetStream
38   # config:
39   #   jetstream:
40   #     fileStore:
41   #       pvc:
42   #         storageClassName: "<STORAGE_CLASS>"

Replace all <REGISTRY> and <REPOSITORY> placeholders with your actual registry values.

If you are using a custom storage class, uncomment the config.jetstream.fileStore.pvc.storageClassName section and set it to your storage class.

If you are using node selectors (e.g., with nvcf-base EKS clusters), uncomment the podTemplate section.

Install

$ helm upgrade --install nats \
>   oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-nats \
>   --version 0.5.0 \
>   --namespace nats-system \
>   --wait --timeout 15m \
>   -f nats-values.yaml

Verify

$ kubectl get pods -n nats-system
$ 
$ # Expected output (3 replicas by default):
$ # NAME     READY   STATUS    RESTARTS   AGE
$ # nats-0   2/2     Running   0          2m
$ # nats-1   2/2     Running   0          2m
$ # nats-2   2/2     Running   0          2m

Verify the NATS cluster has formed:

$ kubectl logs nats-0 -n nats-system -c nats | grep "Cluster Name"

If pods remain in Pending state, check that your storage class is available and that nodes satisfy any configured node selectors.

OpenBao

OpenBao provides Vault-compatible secrets management. It handles secret injection into NVCF service pods and stores sensitive configuration such as Cassandra credentials and registry pull secrets.

Chart	`helm-nvcf-openbao-server`
Version	`0.27.1`
Namespace	`vault-system`
Depends on	NATS (must be running)

NATS must be running and healthy before installing OpenBao. The OpenBao migration job communicates with NATS during initialization.

Configuration

Create openbao-values.yaml with your registry and secret settings (download template):

openbao-values.yaml

1 # OpenBao values for standalone installation
2 # Replace <REGISTRY>, <REPOSITORY>, and secret values with your settings.
3 #
4 # Example:
5 #   REGISTRY: nvcr.io
6 #   REPOSITORY: YOUR_ORG/YOUR_TEAM
7 
8 openbao:
9   migrations:
10     image:
11       registry: "<REGISTRY>"
12       repository: "<REPOSITORY>/nvcf-openbao-migrations"
13     issuerDiscovery:
14       enabled: true  # Recommended true for EKS (discovers OIDC issuer automatically)
15     env:
16       - name: DEFAULT_CASSANDRA_PASSWORD
17         value: "ch@ng3m3"  # Must match Cassandra superuser password
18       - name: NVCF_API_SIDECARS_IMAGE_PULL_SECRET
19         value: "<REGISTRY_CREDENTIAL_B64>"  # base64 of $oauthtoken:<NGC_API_KEY>
20       - name: ADMIN_CLIENT_ID
21         value: ncp  # Do not change
22 
23   injector:
24     image:
25       registry: "<REGISTRY>"
26       repository: "<REPOSITORY>/oss-vault-k8s"
27     agentImage:
28       registry: "<REGISTRY>"
29       repository: "<REPOSITORY>/nvcf-openbao"
30     replicas: 2
31     podDisruptionBudget:
32       minAvailable: 1
33     # Uncomment for node selectors
34     # nodeSelector:
35     #   nvcf.nvidia.com/workload: vault
36 
37   server:
38     image:
39       registry: "<REGISTRY>"
40       repository: "<REPOSITORY>/nvcf-openbao"
41     dataStorage:
42       size: "10Gi"  # 20-50Gi recommended for production
43       # storageClass: "<STORAGE_CLASS>"
44     # Uncomment for node selectors
45     # nodeSelector:
46     #   nvcf.nvidia.com/workload: vault
47     extraContainers:
48       - name: auto-unseal-sidecar
49         image: "<REGISTRY>/<REPOSITORY>/nvcf-openbao:2.5.1-nv-1.1.0"
50         volumeMounts:
51           - name: openbao-server-unseal
52             mountPath: /vault/userconfig/unseal
53             readOnly: true
54         command: ["/bin/sh", "-c"]
55         args:
56           - |
57             echo "Starting auto-unseal monitor..."
58             export BAO_ADDR=http://$HOSTNAME:8200
59             while true; do
60               if [ -f /vault/userconfig/unseal/unseal_key ]; then
61                 UNSEAL_KEY=$(cat /vault/userconfig/unseal/unseal_key)
62                 if [ ! -z "$UNSEAL_KEY" ]; then
63                     bao operator unseal $UNSEAL_KEY
64                     sleep 60
65                     continue
66                 else
67                   echo "Unseal key is empty, waiting..."
68                 fi
69               else
70                 echo "Unseal key file not found, waiting..."
71               fi
72               sleep 10
73             done

Replace the following placeholders:

`<REGISTRY>`	Your container image registry
`<REPOSITORY>`	Your image repository path
`<REGISTRY_CREDENTIAL_B64>`	Base64-encoded registry credential (see standalone-prerequisites)

If you are using a custom storage class, uncomment dataStorage.storageClass and set it appropriately.

If you are using node selectors, uncomment the nodeSelector sections under both injector and server.

Install

$ helm upgrade --install openbao-server \
>   oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-openbao-server \
>   --version 0.27.1 \
>   --namespace vault-system \
>   --wait --wait-for-jobs --timeout 15m \
>   -f openbao-values.yaml

The release name must be openbao-server. Other NVCF charts reference this name for service discovery.

Post-Install Hooks

The OpenBao chart runs two post-install jobs automatically. The --wait-for-jobs flag ensures helm waits for both to complete before returning.

1. Initialize Cluster (openbao-server-initialize-cluster)

This job initializes the OpenBao (Vault) cluster on first install:

Initializes the vault and generates unseal keys
Unseals all server replicas
Saves the unseal key to a Kubernetes secret (openbao-server-unseal) for the auto-unseal sidecar
Enables the Raft storage backend for HA
Registers and enables the JWT secrets plugin
Saves the JWT signing key to a Kubernetes secret (cluster-jwt)

2. Migrations (openbao-server-migrations)

This job runs after the cluster is initialized and configures OpenBao for NVCF services:

Creates KV secret stores for each NVCF service (api, sis, ess, invocation-service, etc.)
Writes the Cassandra password and registry pull secret (from your values file) into the vault
Configures Kubernetes JWT authentication backends so each service can authenticate using its service account
Creates service-specific policies that control which secrets each service can access
Sets up JWT signing roles used by SIS for cluster agent authentication

Both jobs must complete successfully before core services can start. If either job fails, the core services will not be able to authenticate with OpenBao. Check job logs for troubleshooting (see below).

Verify

$ kubectl get pods -n vault-system
$ 
$ # Expected output (3 server replicas + 2 injector replicas + 2 completed jobs):
$ # NAME                                        READY   STATUS      RESTARTS   AGE
$ # openbao-server-0                            2/2     Running     0          5m
$ # openbao-server-1                            2/2     Running     0          5m
$ # openbao-server-2                            2/2     Running     0          5m
$ # openbao-server-agent-injector-...           1/1     Running     0          5m
$ # openbao-server-agent-injector-...           1/1     Running     0          5m
$ # openbao-server-initialize-cluster-...       0/1     Completed   0          5m
$ # openbao-server-migrations-...               0/1     Completed   0          4m

Verify both post-install jobs completed:

$ kubectl get jobs -n vault-system
$ 
$ # Both jobs should show COMPLETIONS 1/1:
$ # NAME                                STATUS     COMPLETIONS   DURATION   AGE
$ # openbao-server-initialize-cluster   Complete   1/1           21s        5m
$ # openbao-server-migrations           Complete   1/1           8s         4m

Check that OpenBao is initialized and unsealed:

$ kubectl exec -n vault-system openbao-server-0 -- bao status
$ 
$ # Look for:
$ #   Initialized     true
$ #   Sealed          false

Troubleshooting

Initialize cluster job fails: Check the init job logs:

$ kubectl logs -n vault-system -l job-name=openbao-server-initialize-cluster --tail=100

Migration job fails: Check the migration job logs for details:

$ kubectl logs -n vault-system -l job-name=openbao-server-migrations --tail=100

Server remains sealed: The auto-unseal sidecar reads from a Kubernetes secret. Verify the unseal key secret exists:

$ kubectl get secret -n vault-system | grep unseal

Stale resources from previous install: If reinstalling OpenBao after a failed attempt, delete all resources in the namespace first to avoid conflicts with leftover secrets, configmaps, and jobs:

$ helm uninstall openbao-server -n vault-system
$ kubectl delete all,cm,secret,pvc,job --all -n vault-system --ignore-not-found

Cassandra

Apache Cassandra provides the persistence layer for NVCF. It stores function metadata, deployment state, and other operational data.

Chart	`helm-nvcf-cassandra`
Version	`0.11.1`
Namespace	`cassandra-system`
Depends on	None (can be installed in parallel with NATS)

Configuration

Create cassandra-values.yaml with your registry and storage settings (download template):

cassandra-values.yaml

1 # Cassandra values for standalone installation
2 # Replace <REGISTRY>, <REPOSITORY>, and storage settings with your configuration.
3 #
4 # Example:
5 #   REGISTRY: nvcr.io
6 #   REPOSITORY: YOUR_ORG/YOUR_TEAM
7 
8 cassandra:
9   global:
10     security:
11       allowInsecureImages: true
12     # Uncomment to set default storage class
13     # defaultStorageClass: "<STORAGE_CLASS>"
14 
15   replicaCount: 3  # Use 1 for local development only
16 
17   image:
18     registry: "<REGISTRY>"
19     repository: "<REPOSITORY>/bitnami-cassandra"
20 
21   dynamicSeedDiscovery:
22     image:
23       registry: "<REGISTRY>"
24       repository: "<REPOSITORY>/bitnami-cassandra"
25 
26   migrations:
27     image:
28       registry: "<REGISTRY>"
29       repository: "<REPOSITORY>/nvcf-cassandra-migrations"
30 
31   initialization:
32     image:
33       registry: "<REGISTRY>"
34       repository: "<REPOSITORY>/alpine-k8s"
35 
36   persistence:
37     size: "10Gi"  # 50-100Gi recommended for production
38 
39   # Uncomment for node selectors
40   # nodeSelector:
41   #   nvcf.nvidia.com/workload: cassandra

Replace all <REGISTRY> and <REPOSITORY> placeholders with your actual registry values.

Adjust persistence.size based on your expected data volume (50-100Gi recommended for production).

If you are using node selectors, uncomment the nodeSelector section.

For local development with a single node, set replicaCount: 1. Production deployments should use a minimum of 3 replicas.

Install

$ helm upgrade --install cassandra \
>   oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-cassandra \
>   --version 0.11.1 \
>   --namespace cassandra-system \
>   --wait --wait-for-jobs --timeout 15m \
>   -f cassandra-values.yaml

Verify

$ kubectl get pods -n cassandra-system
$ 
$ # Expected output (3 replicas by default):
$ # NAME          READY   STATUS      RESTARTS   AGE
$ # cassandra-0   1/1     Running     0          8m
$ # cassandra-1   1/1     Running     0          6m
$ # cassandra-2   1/1     Running     0          4m

Cassandra initialization pods showing “Error” is expected. The cassandra-initialize-cluster job runs multiple pods in parallel and retries on failure. It is normal to see one or more pods with Error status. The deployment is healthy as long as at least one initialization pod reaches Completed and the cassandra-migrations job completes successfully.

Check the initialization and migration jobs:

$ kubectl get jobs -n cassandra-system
$ 
$ # Both jobs should show COMPLETIONS 1/1:
$ # NAME                             COMPLETIONS   DURATION   AGE
$ # cassandra-initialize-cluster     1/1           45s        8m
$ # cassandra-migrations             1/1           30s        7m

Verify Cassandra is accepting connections:

$ kubectl exec -n cassandra-system cassandra-0 -- nodetool status
$ 
$ # All nodes should show UN (Up/Normal) status

Troubleshooting

Pods stuck in Pending: Verify your storage class can provision PVCs of the requested size. Some cloud providers (e.g., AWS EBS gp3) have minimum PVC size requirements.
Initialization job retries: This is normal. The initialization job may fail several times while Cassandra nodes are still starting. As long as one pod eventually reaches Completed, the cluster is healthy.

Migration job fails: Check migration logs:

$ kubectl logs -n cassandra-system -l job-name=cassandra-migrations --tail=100

Verify All Infrastructure

Before proceeding to the core services, confirm all three infrastructure components are healthy:

$ echo "=== NATS ==="
$ kubectl get pods -n nats-system
$ 
$ echo "=== OpenBao ==="
$ kubectl get pods -n vault-system
$ 
$ echo "=== Cassandra ==="
$ kubectl get pods -n cassandra-system

All pods should be in Running or Completed state. If any pods are unhealthy, resolve the issues before continuing.

Next Steps

Once all infrastructure dependencies are running, proceed to standalone-core-services to install the NVCF control plane services.