Phase 2: Core Services#
This phase installs the NVCF control plane services. These services depend on the infrastructure components installed in Phase 1: Infrastructure Dependencies.
Important
All three infrastructure dependencies (NATS, OpenBao, Cassandra) must be running and healthy before proceeding. Verify with:
kubectl get pods -n nats-system -n vault-system -n cassandra-system
Install the services in the order shown below. Services with dependencies are noted — wait for the dependency to be healthy before installing the dependent service.
API Keys#
API Keys provides authentication token management for all NVCF API interactions.
Chart |
|
Version |
|
Namespace |
|
Depends on |
Infrastructure only |
Configuration#
Create api-keys-values.yaml (download template):
api-keys-values.yaml
# API Keys values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
apikeys:
fullnameOverride: api-keys
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/nv-api-keys"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Replace <REGISTRY> and <REPOSITORY> with your registry settings.
Install#
helm upgrade --install api-keys \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api-keys \
--version 1.0.4 \
--namespace api-keys \
--wait --timeout 10m \
-f api-keys-values.yaml
Verify#
kubectl get pods -n api-keys
# Expected: api-keys pod Running
SIS#
The Spot Instance Service (SIS) handles cluster registration and GPU resource management.
Chart |
|
Version |
|
Namespace |
|
Depends on |
Infrastructure only |
Configuration#
Create sis-values.yaml (download template):
sis-values.yaml
# SIS (Spot Instance Service) values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
sis:
fullnameOverride: spot-instance-service
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/spot"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Install#
helm upgrade --install sis \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-sis \
--version 1.3.0 \
--namespace sis \
--wait --timeout 10m \
-f sis-values.yaml
Verify#
kubectl get pods -n sis
# Expected: spot-instance-service pod Running
ESS API#
The ESS (Enterprise Secrets Service) API distributes secrets to NVCF services via OpenBao.
Chart |
|
Version |
|
Namespace |
|
Depends on |
Infrastructure only |
Configuration#
Create ess-api-values.yaml (download template):
ess-api-values.yaml
# ESS API values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
ess:
fullnameOverride: ess-api
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/ess-api"
Install#
helm upgrade --install ess-api \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-ess-api \
--version 1.2.1 \
--namespace ess \
--wait --timeout 10m \
-f ess-api-values.yaml
Verify#
kubectl get pods -n ess
# Expected: ess-api pod Running
NVCF API#
The NVCF API is the primary control plane service. It manages functions, deployments, and account configuration. The API chart includes an account bootstrap job that runs on first install to initialize the NVCF account with registry credentials.
Chart |
|
Version |
|
Namespace |
|
Depends on |
ESS API (must be running) |
Important
The ESS API must be running before installing the NVCF API. The account bootstrap job communicates with ESS during initialization.
Configuration#
Create nvcf-api-values.yaml (download template):
nvcf-api-values.yaml
# NVCF API values for standalone installation
# Replace <REGISTRY>, <REPOSITORY>, and credential placeholders with your settings.
api:
fullnameOverride: nvcf-api
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/strap"
accountBootstrap:
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/alpine-k8s"
tag: 1.30.12
pullPolicy: IfNotPresent
# Registry credentials for function container/chart deployments.
# NGC credentials are used by default. Additional registries (ECR, etc.)
# can be added post-install via the NVCF CLI or API.
registryCredentials:
- registryHostname: nvcr.io
secret:
name: nvcr-containers
value: "<REGISTRY_CREDENTIAL_B64>" # base64 of $oauthtoken:<NGC_API_KEY>
artifactTypes: ["CONTAINER"]
tags: []
description: "NGC Container registry"
- registryHostname: helm.ngc.nvidia.com
secret:
name: nvcr-helmcharts
value: "<REGISTRY_CREDENTIAL_B64>" # base64 of $oauthtoken:<NGC_API_KEY>
artifactTypes: ["HELM"]
tags: []
description: "NGC Helm chart registry"
limits:
maxFunctions: 10
maxTasks: 10
maxTelemetries: 10
maxRegistryCreds: 10
env:
NVCF_NATS_REGION_PLACEMENT_TAG: "dc"
NVCF_SIDECARS_HOSTNAME: "<REGISTRY>"
NVCF_SIDECARS_REPOSITORY: "<REPOSITORY>"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Replace the following placeholders:
|
Your container image registry |
|
Your image repository path |
|
Base64-encoded registry credential (see Prerequisites and Configuration) |
|
Hostname for your Helm chart registry (e.g., |
Install#
helm upgrade --install api \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api \
--version 1.8.0 \
--namespace nvcf \
--wait --wait-for-jobs --timeout 15m \
-f nvcf-api-values.yaml
Important
Monitor for account bootstrap failures. Open a separate terminal and watch events:
kubectl get events -n nvcf -w
The account bootstrap job is the most common failure point (usually due to misconfigured registry credentials in the values file).
Verify#
kubectl get pods -n nvcf
# Expected: nvcf-api pod Running
Check the bootstrap job completed:
kubectl get jobs -n nvcf
# The nvcf-api-account-bootstrap job should show COMPLETIONS 1/1
Note
The bootstrap job auto-deletes after approximately 5 minutes. Monitor events in real-time to catch failures.
Troubleshooting#
Bootstrap job fails: Check the job logs:
kubectl logs job/nvcf-api-account-bootstrap -n nvcf
Registry credential errors: Verify your
<REGISTRY_CREDENTIAL_B64>value is correct. The base64-encoded credential should decode tousername:passwordformat.Recovering from bootstrap failure: Uninstall the API chart, fix the values, and reinstall:
helm uninstall api -n nvcf # Fix nvcf-api-values.yaml helm upgrade --install api ...
Invocation Service#
The Invocation Service handles function invocation requests and routes them to worker nodes.
Chart |
|
Version |
|
Namespace |
|
Depends on |
NVCF API (must be running) |
Configuration#
Create invocation-service-values.yaml (download template):
invocation-service-values.yaml
# Invocation Service values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
invocation:
fullnameOverride: invocation-service
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/nvcf-invocation-service"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Install#
helm upgrade --install invocation-service \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-invocation-service \
--version 1.2.0 \
--namespace nvcf \
--wait --timeout 10m \
-f invocation-service-values.yaml
Verify#
kubectl get pods -n nvcf -l app.kubernetes.io/name=invocation-service
# Expected: invocation-service pod Running
gRPC Proxy#
The gRPC Proxy enables streaming workloads over gRPC connections.
Chart |
|
Version |
|
Namespace |
|
Depends on |
NVCF API (must be running) |
Configuration#
Create grpc-proxy-values.yaml (download template):
grpc-proxy-values.yaml
# gRPC Proxy values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
grpcproxy:
fullnameOverride: grpc-proxy
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/nvcf-grpc-proxy"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Install#
helm upgrade --install grpc-proxy \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-grpc-proxy \
--version 1.3.1 \
--namespace nvcf \
--wait --timeout 10m \
-f grpc-proxy-values.yaml
Verify#
kubectl get pods -n nvcf -l app.kubernetes.io/name=grpc-proxy
# Expected: grpc-proxy pod Running
Notary Service#
The Notary Service handles request signing and validation for secure inter-service communication.
Chart |
|
Version |
|
Namespace |
|
Depends on |
Infrastructure only |
Configuration#
Create notary-service-values.yaml (download template):
notary-service-values.yaml
# Notary Service values for standalone installation
# Replace <REGISTRY> and <REPOSITORY> with your container registry settings.
notary:
fullnameOverride: notary-service
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/notary-service"
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Install#
helm upgrade --install notary-service \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-notary-service \
--version 1.1.0 \
--namespace nvcf \
--wait --timeout 10m \
-f notary-service-values.yaml
Verify#
kubectl get pods -n nvcf -l app.kubernetes.io/name=notary-service
# Expected: notary-service pod Running
Admin Token Issuer Proxy#
The Admin Token Issuer Proxy provides an admin endpoint for generating API keys without requiring pre-existing credentials. It is used for initial setup and emergency access.
Chart |
|
Version |
|
Namespace |
|
Depends on |
API Keys (must be running) |
Configuration#
Create admin-issuer-proxy-values.yaml (download template):
admin-issuer-proxy-values.yaml
# Admin Token Issuer Proxy values for standalone installation
# Replace <REGISTRY>, <REPOSITORY>, and <DOMAIN> with your settings.
adminIssuerProxy:
fullnameOverride: admin-token-issuer-proxy
image:
registry: "<REGISTRY>"
repository: "<REPOSITORY>/admin-token-issuer-proxy"
# Gateway is disabled during Phase 2 (core services) because the Gateway
# resource and CRDs are not yet installed. The gateway route for the admin
# endpoint is created in Phase 3 when the Gateway Routes chart is installed.
gateway:
enabled: false
# Uncomment for node selectors
# nodeSelector:
# nvcf.nvidia.com/workload: control-plane
Note
The gateway setting is false during this phase because the Gateway API CRDs and
Gateway resource are not yet installed. The admin endpoint HTTPRoute will be created in
Phase 3: Gateway and Ingress when the Gateway Routes chart is deployed.
Install#
helm upgrade --install admin-issuer-proxy \
oci://${REGISTRY}/${REPOSITORY}/helm-admin-token-issuer-proxy \
--version 1.2.1 \
--namespace api-keys \
--wait --timeout 10m \
-f admin-issuer-proxy-values.yaml
Verify#
kubectl get pods -n api-keys
# Expected: api-keys and admin-token-issuer-proxy pods both Running
Verify All Core Services#
Before proceeding to gateway configuration, confirm all core services are healthy:
echo "=== NVCF namespace ==="
kubectl get pods -n nvcf
echo "=== API Keys namespace ==="
kubectl get pods -n api-keys
echo "=== ESS namespace ==="
kubectl get pods -n ess
echo "=== SIS namespace ==="
kubectl get pods -n sis
All pods should be in Running state. Verify helm releases:
helm list -A
# All releases should show STATUS: deployed
Tip
If any pod is stuck in CrashLoopBackOff, check its logs with
kubectl logs <pod-name> -n <namespace> --tail=100. Common causes include
misconfigured secrets or unreachable infrastructure services.
Next Steps#
Once all core services are running, proceed to Phase 3: Gateway and Ingress to configure ingress and verify end-to-end API connectivity.