Control Plane Operations#
This section provides runbooks for operating the self-hosted NVCF control plane, including encryption key rotation, service management, and upgrades.
Encryption Key Management#
Self-hosted NVCF uses a two-tier encryption hierarchy to protect secrets stored in the Encrypted Secret Store (ESS):
Key |
Purpose |
|---|---|
MEK (Master Encryption Key) |
A single AES-256-GCM key stored in OpenBAO. The MEK encrypts (wraps) all NEKs. It is shared across NVCF services in the control plane. |
NEK (Namespace Encryption Key) |
Per-namespace AES-256-GCM keys stored in Cassandra (encrypted by the MEK). NEKs directly encrypt user secrets such as API keys and registry credentials. |
When a user stores a secret through the NVCF API, ESS encrypts it with the active NEK for that namespace. The NEK itself is stored in Cassandra, encrypted by the MEK. To decrypt a secret, ESS retrieves the NEK from Cassandra, decrypts it using the MEK from OpenBAO, then decrypts the secret.
Note
Rotating the MEK requires a follow-up NEK re-encryption step so that all NEKs in Cassandra are re-wrapped with the new MEK. See the individual runbooks below for the full procedures.
Key Rotation Runbooks#
NEK Rotation and Re-encryption — Rotate the namespace encryption key and re-encrypt NEKs after a MEK rotation.
MEK (Master Encryption Key) Rotation — Rotate the master encryption key stored in OpenBAO.
Basic Operations#
Service Reference#
The following table lists all NVCF control plane services with their namespace, resource name, and resource type. Use these values in the commands throughout this section.
Namespace |
Service |
Resource Name |
Type |
|---|---|---|---|
|
NVCF API |
|
Deployment |
|
Invocation Service |
|
Deployment |
|
gRPC Proxy |
|
Deployment |
|
Notary Service |
|
Deployment |
|
Spot Instance Service |
|
Deployment |
|
API Keys Service |
|
Deployment |
|
Admin Issuer Proxy |
|
Deployment |
|
ESS API |
|
Deployment |
|
NATS |
|
StatefulSet |
|
OpenBao |
|
StatefulSet |
|
Cassandra |
|
StatefulSet |
|
NVCA Operator |
|
Deployment |
|
Envoy Gateway |
|
Deployment |
Restarting a Service#
Restarting a Deployment:
kubectl rollout restart deployment/<name> -n <namespace>
# Example: restart the NVCF API
kubectl rollout restart deployment/nvcf-api -n nvcf
# Verify the rollout completes
kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
Restarting a StatefulSet:
StatefulSets perform a rolling restart, terminating and recreating one pod at a time in reverse ordinal order (highest first). For clustered services like NATS, OpenBao, and Cassandra, this preserves quorum as long as a majority of replicas remain available.
kubectl rollout restart statefulset/<name> -n <namespace>
# Example: restart Cassandra
kubectl rollout restart statefulset/cassandra -n cassandra-system
# Verify the rollout completes
kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s
Note
For OpenBao, verify the seal status after the rollout completes. Each pod must unseal before it can serve requests:
kubectl exec -n vault-system openbao-server-0 -- bao status
Restarting all Deployments in a namespace:
kubectl rollout restart deployment -n <namespace>
# Example: restart all services in the nvcf namespace
kubectl rollout restart deployment -n nvcf
Checking Service Health#
List pods and their status:
# All pods in a namespace
kubectl get pods -n <namespace>
# All NVCF control plane pods at a glance
for ns in nvcf sis api-keys ess nats-system vault-system cassandra-system; do
echo "=== $ns ==="
kubectl get pods -n $ns
done
Check logs for a service:
# Recent logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=<name> --tail=50
# Follow logs in real time
kubectl logs -n <namespace> -l app.kubernetes.io/name=<name> -f
Describe a pod for events and conditions:
kubectl describe pod -n <namespace> -l app.kubernetes.io/name=<name>
Scaling a Service#
To temporarily take a service offline (for example, during maintenance), scale it to zero, perform the work, then scale it back:
# Scale down
kubectl scale deployment/<name> -n <namespace> --replicas=0
# ... perform maintenance ...
# Scale back up
kubectl scale deployment/<name> -n <namespace> --replicas=1
# Verify
kubectl rollout status deployment/<name> -n <namespace>
Warning
Scaling infrastructure StatefulSets (Cassandra, NATS, OpenBao) to zero will cause a full outage. Only do this if you understand the implications for data availability and quorum.
Upgrading Services#
Danger
Upgrades are not officially supported during the Early Access period. The self-hosted
NVCF stack does not yet have a validated upgrade path. Even a full helmfile sync may
introduce breaking changes between releases — there is no guarantee of backward
compatibility for configuration, database schemas, or inter-service APIs at this stage.
The guidance below is provided for advanced users who need to apply targeted fixes or hotfixes to individual services. It is not a substitute for a validated upgrade procedure.
Warning
Spot upgrades carry additional risk. Beyond the general Early Access limitations above, spot-upgrading an individual Helm chart bypasses the Helmfile’s version coordination and automatic database migrations. Proceed only when you understand the compatibility implications for the specific version you are upgrading to.
When to Spot Upgrade#
Use a spot upgrade when |
Use a full Helmfile upgrade when |
|---|---|
Applying a patch release to a single service (e.g., |
Upgrading the entire stack to a new minor or major version |
Applying a targeted hotfix provided by NVIDIA support |
The new version includes Cassandra schema migrations |
You need to roll out a configuration change that requires a new chart version |
Multiple services need to be upgraded together for compatibility |
Pre-Upgrade Checklist#
Before upgrading any chart:
Note the current chart version and app version:
helm list -n <namespace>
Back up the current Helm values:
helm get values <release> -n <namespace> -o yaml > <release>-values-backup.yaml
Review release notes for the target version. Check for breaking changes, required value changes, or new dependencies.
Verify the cluster is healthy before starting — all pods running, no pending operations.
Spot Upgrading a Helm Chart#
The following commands work for any Deployment-based service. Replace the placeholders with values from the Service Reference table above.
# 1. Upgrade the chart
helm upgrade <release> \
oci://<registry>/<repository>/<chart> \
--version <new-version> \
--namespace <namespace> \
--wait --timeout 5m \
-f <release>-values.yaml
# 2. Verify the rollout
kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
# 3. Confirm the new chart version
helm list -n <namespace>
Example — upgrading the NVCF API chart:
helm upgrade api \
oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api \
--version 2.1.0 \
--namespace nvcf \
--wait --timeout 5m \
-f nvcf-api-values.yaml
kubectl rollout status deployment/api -n nvcf --timeout=120s
Important
Always pass your values file (-f values.yaml) during upgrade. If you omit it, Helm
resets all values to chart defaults, which can break your deployment. If you no longer
have the original values file, back up the current values first with helm get values.
Upgrading StatefulSet-Based Services#
Cassandra, NATS, and OpenBao are deployed as StatefulSets. The helm upgrade command is the
same, but the rollout behavior differs:
Rolling update: StatefulSets restart pods one at a time in reverse ordinal order, waiting for each pod to become ready before proceeding to the next.
Quorum preserved: For 3-replica clusters, at most one pod is unavailable at a time, maintaining quorum throughout the upgrade.
helm upgrade <release> \
oci://<registry>/<repository>/<chart> \
--version <new-version> \
--namespace <namespace> \
--wait --timeout 10m \
-f <release>-values.yaml
kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s
Service-specific notes:
Cassandra |
After upgrading, check if the new version requires a schema migration. If a
|
OpenBao |
After each pod restarts, verify it unseals successfully:
|
NATS |
The NATS cluster maintains message availability during rolling updates as long as a
majority of nodes remain online. Monitor the cluster state:
|
Rolling Back#
If an upgrade causes issues, roll back to the previous Helm revision:
# List revision history
helm history <release> -n <namespace>
# Roll back to the previous revision
helm rollback <release> -n <namespace>
# Verify
kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
# Or for StatefulSets
kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s
Note
helm rollback reverts both the chart version and the values. If you made intentional
value changes alongside the version upgrade, you will need to re-apply them after the
rollback.
Observability#
For observability configuration and reference architecture, see Observability Configuration.