Phase 2: Core Services

View as Markdown

This phase installs the NVCF control plane services. These services depend on the infrastructure components installed in standalone-infrastructure.

All three infrastructure dependencies (NATS, OpenBao, Cassandra) must be running and healthy before proceeding. Verify with:

$kubectl get pods -n nats-system -n vault-system -n cassandra-system

Install the services in the order shown below. Services with dependencies are noted — wait for the dependency to be healthy before installing the dependent service.

API Keys

API Keys provides authentication token management for all NVCF API interactions.

Charthelm-nvcf-api-keys
Version1.0.4
Namespaceapi-keys
Depends onInfrastructure only

Configuration

Create api-keys-values.yaml (download template):

api-keys-values.yaml
1# API Keys values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4apikeys:
5 fullnameOverride: api-keys
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/nv-api-keys"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Replace <REGISTRY> and <REPOSITORY> with your registry settings.

Install

$helm upgrade --install api-keys \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api-keys \
> --version 1.0.4 \
> --namespace api-keys \
> --wait --timeout 10m \
> -f api-keys-values.yaml

Verify

$kubectl get pods -n api-keys
$
$# Expected: api-keys pod Running

SIS

The Spot Instance Service (SIS) handles cluster registration and GPU resource management.

Charthelm-nvcf-sis
Version1.8.0
Namespacesis
Depends onInfrastructure only

Configuration

Create sis-values.yaml (download template):

sis-values.yaml
1# SIS (Spot Instance Service) values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4sis:
5 fullnameOverride: spot-instance-service
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/spot"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Install

$helm upgrade --install sis \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-sis \
> --version 1.8.0 \
> --namespace sis \
> --wait --timeout 10m \
> -f sis-values.yaml

Verify

$kubectl get pods -n sis
$
$# Expected: spot-instance-service pod Running

ESS API

The ESS (Enterprise Secrets Service) API distributes secrets to NVCF services via OpenBao.

Charthelm-nvcf-ess-api
Version1.3.0
Namespaceess
Depends onInfrastructure only

Configuration

Create ess-api-values.yaml (download template):

ess-api-values.yaml
1# ESS API values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4ess:
5 fullnameOverride: ess-api
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/ess-api"

Install

$helm upgrade --install ess-api \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-ess-api \
> --version 1.3.0 \
> --namespace ess \
> --wait --timeout 10m \
> -f ess-api-values.yaml

Verify

$kubectl get pods -n ess
$
$# Expected: ess-api pod Running

NVCF API

The NVCF API is the primary control plane service. It manages functions, deployments, and account configuration. The API chart includes an account bootstrap job that runs on first install to initialize the NVCF account with registry credentials.

Charthelm-nvcf-api
Version1.13.0
Namespacenvcf
Depends onESS API (must be running)

The ESS API must be running before installing the NVCF API. The account bootstrap job communicates with ESS during initialization.

Configuration

Create nvcf-api-values.yaml (download template):

nvcf-api-values.yaml
1# NVCF API values for standalone installation
2# Replace <REGISTRY>, <REPOSITORY>, and credential placeholders with your settings.
3
4api:
5 fullnameOverride: nvcf-api
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/strap"
9
10 accountBootstrap:
11 image:
12 registry: "<REGISTRY>"
13 repository: "<REPOSITORY>/alpine-k8s"
14 tag: 1.30.12
15 pullPolicy: IfNotPresent
16
17 # Registry credentials for function container/chart deployments.
18 # NGC credentials are used by default. Additional registries (ECR, etc.)
19 # can be added post-install via the NVCF CLI or API.
20 registryCredentials:
21 - registryHostname: nvcr.io
22 secret:
23 name: nvcr-containers
24 value: "<REGISTRY_CREDENTIAL_B64>" # base64 of $oauthtoken:<NGC_API_KEY>
25 artifactTypes: ["CONTAINER"]
26 tags: []
27 description: "NGC Container registry"
28 - registryHostname: helm.ngc.nvidia.com
29 secret:
30 name: nvcr-helmcharts
31 value: "<REGISTRY_CREDENTIAL_B64>" # base64 of $oauthtoken:<NGC_API_KEY>
32 artifactTypes: ["HELM"]
33 tags: []
34 description: "NGC Helm chart registry"
35
36 limits:
37 maxFunctions: 10
38 maxTasks: 10
39 maxTelemetries: 10
40 maxRegistryCreds: 10
41
42 env:
43 NVCF_NATS_REGION_PLACEMENT_TAG: "dc"
44 NVCF_SIDECARS_HOSTNAME: "<REGISTRY>"
45 NVCF_SIDECARS_REPOSITORY: "<REPOSITORY>"
46
47 # Uncomment for node selectors
48 # nodeSelector:
49 # nvcf.nvidia.com/workload: control-plane

Replace the following placeholders:

<REGISTRY>Your container image registry
<REPOSITORY>Your image repository path
<REGISTRY_CREDENTIAL_B64>Base64-encoded registry credential (see standalone-prerequisites)
<HELM_REGISTRY>Hostname for your Helm chart registry (e.g., helm.ngc.nvidia.com or your ECR hostname)

Install

$helm upgrade --install api \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api \
> --version 1.13.0 \
> --namespace nvcf \
> --wait --wait-for-jobs --timeout 15m \
> -f nvcf-api-values.yaml

Monitor for account bootstrap failures. Open a separate terminal and watch events:

$kubectl get events -n nvcf -w

The account bootstrap job is the most common failure point (usually due to misconfigured registry credentials in the values file).

Verify

$kubectl get pods -n nvcf
$
$# Expected: nvcf-api pod Running

Check the bootstrap job completed:

$kubectl get jobs -n nvcf
$
$# The nvcf-api-account-bootstrap job should show COMPLETIONS 1/1

The bootstrap job auto-deletes after approximately 5 minutes. Monitor events in real-time to catch failures.

Troubleshooting

  • Bootstrap job fails: Check the job logs:

    $kubectl logs job/nvcf-api-account-bootstrap -n nvcf
  • Registry credential errors: Verify your <REGISTRY_CREDENTIAL_B64> value is correct. The base64-encoded credential should decode to username:password format.

  • Recovering from bootstrap failure: Uninstall the API chart, fix the values, and reinstall:

    $helm uninstall api -n nvcf
    $# Fix nvcf-api-values.yaml
    $helm upgrade --install api ...

Invocation Service

The Invocation Service handles function invocation requests and routes them to worker nodes.

Charthelm-nvcf-invocation-service
Version1.3.1
Namespacenvcf
Depends onNVCF API (must be running)

Configuration

Create invocation-service-values.yaml (download template):

invocation-service-values.yaml
1# Invocation Service values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4invocation:
5 fullnameOverride: invocation-service
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/nvcf-invocation-service"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Install

$helm upgrade --install invocation-service \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-invocation-service \
> --version 1.3.1 \
> --namespace nvcf \
> --wait --timeout 10m \
> -f invocation-service-values.yaml

Verify

$kubectl get pods -n nvcf -l app.kubernetes.io/name=invocation-service
$
$# Expected: invocation-service pod Running

gRPC Proxy

The gRPC Proxy enables streaming workloads over gRPC connections.

Charthelm-nvcf-grpc-proxy
Version1.4.0
Namespacenvcf
Depends onNVCF API (must be running)

Configuration

Create grpc-proxy-values.yaml (download template):

grpc-proxy-values.yaml
1# gRPC Proxy values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4grpcproxy:
5 fullnameOverride: grpc-proxy
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/nvcf-grpc-proxy"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Install

$helm upgrade --install grpc-proxy \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-grpc-proxy \
> --version 1.4.0 \
> --namespace nvcf \
> --wait --timeout 10m \
> -f grpc-proxy-values.yaml

Verify

$kubectl get pods -n nvcf -l app.kubernetes.io/name=grpc-proxy
$
$# Expected: grpc-proxy pod Running

Notary Service

The Notary Service handles request signing and validation for secure inter-service communication.

Charthelm-nvcf-notary-service
Version1.2.0
Namespacenvcf
Depends onInfrastructure only

Configuration

Create notary-service-values.yaml (download template):

notary-service-values.yaml
1# Notary Service values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4notary:
5 fullnameOverride: notary-service
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/notary-service"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Install

$helm upgrade --install notary-service \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-notary-service \
> --version 1.2.0 \
> --namespace nvcf \
> --wait --timeout 10m \
> -f notary-service-values.yaml

Verify

$kubectl get pods -n nvcf -l app.kubernetes.io/name=notary-service
$
$# Expected: notary-service pod Running

Reval

Reval renders Helm chart functions without requiring direct cluster access. It is installed in the nvcf namespace with the helm-reval chart.

Charthelm-reval
Version1.2.2
Namespacenvcf
Depends onInfrastructure only

Configuration

Create reval-values.yaml (download template):

reval-values.yaml
1# Reval values for standalone installation
2# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
3
4reval:
5 fullnameOverride: reval
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/reval-server"
9
10 # Uncomment for node selectors
11 # nodeSelector:
12 # nvcf.nvidia.com/workload: control-plane

Replace <REGISTRY> and <REPOSITORY> with your registry settings.

Install

$helm upgrade --install reval \
> oci://${REGISTRY}/${REPOSITORY}/helm-reval \
> --version 1.2.2 \
> --namespace nvcf \
> --wait --timeout 10m \
> -f reval-values.yaml

Verify

$kubectl get pods -n nvcf -l app.kubernetes.io/name=reval
$
$# Expected: reval pod Running

Admin Token Issuer Proxy

The Admin Token Issuer Proxy provides an admin endpoint for generating API keys without requiring pre-existing credentials. It is used for initial setup and emergency access.

Charthelm-admin-token-issuer-proxy
Version1.2.2
Namespaceapi-keys
Depends onAPI Keys (must be running)

Configuration

Create admin-issuer-proxy-values.yaml (download template):

admin-issuer-proxy-values.yaml
1# Admin Token Issuer Proxy values for standalone installation
2# Replace <REGISTRY>, <REPOSITORY>, and <DOMAIN> with your settings.
3
4adminIssuerProxy:
5 fullnameOverride: admin-token-issuer-proxy
6 image:
7 registry: "<REGISTRY>"
8 repository: "<REPOSITORY>/admin-token-issuer-proxy"
9
10 # Gateway is disabled during Phase 2 (core services) because the Gateway
11 # resource and CRDs are not yet installed. The gateway route for the admin
12 # endpoint is created in Phase 3 when the Gateway Routes chart is installed.
13 gateway:
14 enabled: false
15
16 # Uncomment for node selectors
17 # nodeSelector:
18 # nvcf.nvidia.com/workload: control-plane

The gateway setting is false during this phase because the Gateway API CRDs and Gateway resource are not yet installed. The admin endpoint HTTPRoute will be created in standalone-gateway when the Gateway Routes chart is deployed.

Install

$helm upgrade --install admin-issuer-proxy \
> oci://${REGISTRY}/${REPOSITORY}/helm-admin-token-issuer-proxy \
> --version 1.2.2 \
> --namespace api-keys \
> --wait --timeout 10m \
> -f admin-issuer-proxy-values.yaml

Verify

$kubectl get pods -n api-keys
$
$# Expected: api-keys and admin-token-issuer-proxy pods both Running

Verify All Core Services

Before proceeding to gateway configuration, confirm all core services are healthy:

$echo "=== NVCF namespace ==="
$kubectl get pods -n nvcf
$
$echo "=== API Keys namespace ==="
$kubectl get pods -n api-keys
$
$echo "=== ESS namespace ==="
$kubectl get pods -n ess
$
$echo "=== SIS namespace ==="
$kubectl get pods -n sis

All pods should be in Running state. Verify helm releases:

$helm list -A
$
$# All releases should show STATUS: deployed

If any pod is stuck in CrashLoopBackOff, check its logs with kubectl logs <pod-name> -n <namespace> --tail=100. Common causes include misconfigured secrets or unreachable infrastructure services.

Next Steps

Once all core services are running, proceed to standalone-gateway to configure ingress and verify end-to-end API connectivity.