Control Plane Operations

View as Markdown

This section provides runbooks for operating the self-hosted NVCF control plane, including encryption key rotation, service management, and upgrades.

Encryption Key Management

Self-hosted NVCF uses a two-tier encryption hierarchy to protect secrets stored in the Encrypted Secret Store (ESS):

KeyPurpose
MEK (Master Encryption Key)A single AES-256-GCM key stored in OpenBAO. The MEK encrypts (wraps) all NEKs. It is shared across NVCF services in the control plane.
NEK (Namespace Encryption Key)Per-namespace AES-256-GCM keys stored in Cassandra (encrypted by the MEK). NEKs directly encrypt user secrets such as API keys and registry credentials.

When a user stores a secret through the NVCF API, ESS encrypts it with the active NEK for that namespace. The NEK itself is stored in Cassandra, encrypted by the MEK. To decrypt a secret, ESS retrieves the NEK from Cassandra, decrypts it using the MEK from OpenBAO, then decrypts the secret.

Key Rotation Runbooks

Basic Operations

Service Reference

The following table lists all NVCF control plane services with their namespace, resource name, and resource type. Use these values in the commands throughout this section.

NamespaceServiceResource NameType
nvcfNVCF APInvcf-apiDeployment
nvcfInvocation Serviceinvocation-serviceDeployment
nvcfgRPC Proxygrpc-proxy-deploymentDeployment
nvcfNotary Servicenotary-serviceDeployment
sisSpot Instance Servicespot-instance-serviceDeployment
api-keysAPI Keys Serviceapi-keysDeployment
api-keysAdmin Issuer Proxyadmin-token-issuer-proxyDeployment
essESS APIess-api-helm-nvcf-ess-api-deploymentDeployment
nats-systemNATSnatsStatefulSet
vault-systemOpenBaoopenbao-serverStatefulSet
cassandra-systemCassandracassandraStatefulSet
nvca-operatorNVCA Operatornvca-operatorDeployment
envoy-gateway-systemEnvoy Gatewayenvoy-gatewayDeployment

Restarting a Service

Restarting a Deployment:

$kubectl rollout restart deployment/<name> -n <namespace>
$
$# Example: restart the NVCF API
$kubectl rollout restart deployment/nvcf-api -n nvcf
$
$# Verify the rollout completes
$kubectl rollout status deployment/<name> -n <namespace> --timeout=120s

Restarting a StatefulSet:

StatefulSets perform a rolling restart, terminating and recreating one pod at a time in reverse ordinal order (highest first). For clustered services like NATS, OpenBao, and Cassandra, this preserves quorum as long as a majority of replicas remain available.

$kubectl rollout restart statefulset/<name> -n <namespace>
$
$# Example: restart Cassandra
$kubectl rollout restart statefulset/cassandra -n cassandra-system
$
$# Verify the rollout completes
$kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s

For OpenBao, verify the seal status after the rollout completes. Each pod must unseal before it can serve requests:

$kubectl exec -n vault-system openbao-server-0 -- bao status

Restarting all Deployments in a namespace:

$kubectl rollout restart deployment -n <namespace>
$
$# Example: restart all services in the nvcf namespace
$kubectl rollout restart deployment -n nvcf

Checking Service Health

List pods and their status:

$# All pods in a namespace
$kubectl get pods -n <namespace>
$
$# All NVCF control plane pods at a glance
$for ns in nvcf sis api-keys ess nats-system vault-system cassandra-system; do
$ echo "=== $ns ==="
$ kubectl get pods -n $ns
$done

Check logs for a service:

$# Recent logs
$kubectl logs -n <namespace> -l app.kubernetes.io/name=<name> --tail=50
$
$# Follow logs in real time
$kubectl logs -n <namespace> -l app.kubernetes.io/name=<name> -f

Describe a pod for events and conditions:

$kubectl describe pod -n <namespace> -l app.kubernetes.io/name=<name>

Scaling a Service

To temporarily take a service offline (for example, during maintenance), scale it to zero, perform the work, then scale it back:

$# Scale down
$kubectl scale deployment/<name> -n <namespace> --replicas=0
$
$# ... perform maintenance ...
$
$# Scale back up
$kubectl scale deployment/<name> -n <namespace> --replicas=1
$
$# Verify
$kubectl rollout status deployment/<name> -n <namespace>

Scaling infrastructure StatefulSets (Cassandra, NATS, OpenBao) to zero will cause a full outage. Only do this if you understand the implications for data availability and quorum.

Upgrading Services

Upgrades are not officially supported during the Early Access period. The self-hosted NVCF stack does not yet have a validated upgrade path. Even a full helmfile sync may introduce breaking changes between releases — there is no guarantee of backward compatibility for configuration, database schemas, or inter-service APIs at this stage.

The guidance below is provided for advanced users who need to apply targeted fixes or hotfixes to individual services. It is not a substitute for a validated upgrade procedure.

Spot upgrades carry additional risk. Beyond the general Early Access limitations above, spot-upgrading an individual Helm chart bypasses the Helmfile’s version coordination and automatic database migrations. Proceed only when you understand the compatibility implications for the specific version you are upgrading to.

When to Spot Upgrade

Use a spot upgrade whenUse a full Helmfile upgrade when
Applying a patch release to a single service (e.g., 1.2.6 to 1.2.7)Upgrading the entire stack to a new minor or major version
Applying a targeted hotfix provided by NVIDIA supportThe new version includes Cassandra schema migrations
You need to roll out a configuration change that requires a new chart versionMultiple services need to be upgraded together for compatibility

Pre-Upgrade Checklist

Before upgrading any chart:

  1. Note the current chart version and app version:

    $helm list -n <namespace>
  2. Back up the current Helm values:

    $helm get values <release> -n <namespace> -o yaml > <release>-values-backup.yaml
  3. Review release notes for the target version. Check for breaking changes, required value changes, or new dependencies.

  4. Verify the cluster is healthy before starting — all pods running, no pending operations.

Spot Upgrading a Helm Chart

The following commands work for any Deployment-based service. Replace the placeholders with values from the [Service Reference] table above.

$# 1. Upgrade the chart
$helm upgrade <release> \
> oci://<registry>/<repository>/<chart> \
> --version <new-version> \
> --namespace <namespace> \
> --wait --timeout 5m \
> -f <release>-values.yaml
$
$# 2. Verify the rollout
$kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
$
$# 3. Confirm the new chart version
$helm list -n <namespace>

Example — upgrading the NVCF API chart:

$helm upgrade api \
> oci://${REGISTRY}/${REPOSITORY}/helm-nvcf-api \
> --version 2.1.0 \
> --namespace nvcf \
> --wait --timeout 5m \
> -f nvcf-api-values.yaml
$
$kubectl rollout status deployment/api -n nvcf --timeout=120s

Always pass your values file (-f values.yaml) during upgrade. If you omit it, Helm resets all values to chart defaults, which can break your deployment. If you no longer have the original values file, back up the current values first with helm get values.

Upgrading StatefulSet-Based Services

Cassandra, NATS, and OpenBao are deployed as StatefulSets. The helm upgrade command is the same, but the rollout behavior differs:

  • Rolling update: StatefulSets restart pods one at a time in reverse ordinal order, waiting for each pod to become ready before proceeding to the next.
  • Quorum preserved: For 3-replica clusters, at most one pod is unavailable at a time, maintaining quorum throughout the upgrade.
$helm upgrade <release> \
> oci://<registry>/<repository>/<chart> \
> --version <new-version> \
> --namespace <namespace> \
> --wait --timeout 10m \
> -f <release>-values.yaml
$
$kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s

Service-specific notes:

CassandraAfter upgrading, check if the new version requires a schema migration. If a cassandra-migrations job exists in the chart, it will run automatically as a Helm post-upgrade hook. Verify it completes: kubectl get jobs -n cassandra-system.
OpenBaoAfter each pod restarts, verify it unseals successfully: kubectl exec -n vault-system openbao-server-0 -- bao status. If auto-unseal is configured, this happens automatically. Otherwise, you must unseal each pod manually.
NATSThe NATS cluster maintains message availability during rolling updates as long as a majority of nodes remain online. Monitor the cluster state: kubectl logs -n nats-system nats-0 --tail=10.

Rolling Back

If an upgrade causes issues, roll back to the previous Helm revision:

$# List revision history
$helm history <release> -n <namespace>
$
$# Roll back to the previous revision
$helm rollback <release> -n <namespace>
$
$# Verify
$kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
$
$# Or for StatefulSets
$kubectl rollout status statefulset/<name> -n <namespace> --timeout=300s

helm rollback reverts both the chart version and the values. If you made intentional value changes alongside the version upgrade, you will need to re-apply them after the rollback.

Observability

For observability configuration and reference architecture, see self-hosted-observability.