Troubleshooting | NVIDIA Cloud Functions

This appendix contains common issues, tips, and clarifications learned from deploying self-hosted NVCF.

Configuration Issues and Best Practices

Incorrect Base64 Docker Credentials Format

Symptom:

API migration job fails during installation
Error appears as a “timeout” in the logs
Re-running helmfile sync or helmfile apply appears to succeed but the deployment doesn’t work properly
Functions fail to deploy or pull images

Root Cause:

The base64-encoded Docker credential in secrets.yaml was incorrectly formatted. A common mistake is encoding only the NGC API key instead of the full basic auth credential in the format $oauthtoken:API_KEY.

Incorrect (will fail):

$ # ❌ WRONG - Only encoding the API key
$ echo -n 'nvapi-1234567890abcdef' | base64
$ # Results in: bnZhcGktMTIzNDU2Nzg5MGFiY2RlZg==

Correct:

$ # ✅ CORRECT - Encoding the full credential in basic auth format
$ echo -n '$oauthtoken:nvapi-1234567890abcdef' | base64
$ # Results in: JG9hdXRodG9rZW46bnZhcGktMTIzNDU2Nzg5MGFiY2RlZg==

How to Diagnose:

Check the migration job logs specifically:

$ kubectl logs -n nvcf job/nvcf-api-migration -c migration

If you don’t see detailed errors, add debug output to migration scripts:

$ # Add set -x to the migration script for verbose output
$ kubectl edit configmap -n nvcf nvcf-api-migration-scripts
$ # Add 'set -x' at the top of the script

Fix your secrets.yaml with correct base64 credential, then follow the clean-install-procedure.

How to Prevent:

Always use the correct format: Encode $oauthtoken:YOUR_API_KEY, not just the API key

Verify before deploying: Decode your base64 string to verify it’s correct:

$ # Verify your encoded credential
$ echo 'YOUR_BASE64_STRING' | base64 -d
$ # Should output: $oauthtoken:nvapi-1234567890abcdef

Test NGC authentication: Before deploying, test that your credential works:

$ # Test NGC login with your credential
$ echo 'YOUR_BASE64_STRING' | base64 -d | IFS=: read username password
$ docker login nvcr.io -u "$username" -p "$password"

Registry Credential Change Not Taking Effect

Task creation fails with a Missing <TYPE> registry credential for hostname error, or keeps using a previous value, shortly after you add, update, or delete a registry credential, even though nvcf-cli registry-credential list and get already show the new value.

This is expected propagation delay, not a failure. Task processing caches each account’s registry credentials for about 5 minutes (nvct.nvcf.cache-ttl, default PT5M), and picks up the change once that cached copy refreshes.

Wait up to about 5 minutes and retry the task.
To apply the change immediately, restart the task service:
```
$ kubectl -n nvcf rollout restart deployment/nvct-api
```

Identifying Deployment Issues

Use these commands to diagnose deployment problems. For phase-by-phase monitoring during installation, see the Deployment Progression section in helmfile-installation.

Find Stuck Deployments:

$ # Pods stuck in pending state
$ kubectl get pods -A | grep Pending
$ 
$ # Pods with image pull issues
$ kubectl get pods -A | grep -E "ImagePullBackOff|ErrImagePull"
$ 
$ # Pods in crash loops
$ kubectl get pods -A | grep CrashLoopBackOff
$ 
$ # Check recent events for errors
$ kubectl get events --sort-by='.lastTimestamp' -A | grep -i error | tail -10

Resource Check:

$ # Check node resources
$ kubectl top nodes
$ 
$ # Check pod resource usage
$ kubectl top pods -A | grep -E "nvcf|nats|cassandra|openbao"

Installation & Deployment Issues

Account Bootstrap Job Failures

Symptom:

helmfile sync hangs or fails during the services phase
Events show BackoffLimitExceeded for nvcf-api-account-bootstrap
Bootstrap pod shows CrashLoopBackOff or Error status

Diagnosis:

Watch events in real-time (run this as soon as helmfile reaches services phase):
```
$ kubectl get events -n nvcf -w
```

Check the bootstrap job logs:

$ kubectl logs job/nvcf-api-account-bootstrap -n nvcf

Check the NVCF API logs for detailed error messages:

$ kubectl logs -n nvcf -l app.kubernetes.io/name=nvcf-api --tail=100

The bootstrap job auto-deletes after ~5 minutes (ttlSecondsAfterFinished: 300). Monitor events to catch failures in real-time.

Enable debug logging for the bootstrap job. The account bootstrap script supports a DEBUG environment variable that enables verbose output. To enable it before redeploying, patch the bootstrap secret:

$ kubectl patch secret nvcf-api-account-bootstrap-secret -n nvcf \
>   -p '{"stringData":{"DEBUG":"true"}}'

Then follow the “Recovering from Services Failures” steps in helmfile-installation to redeploy. The next bootstrap job run will include detailed debug logs visible via kubectl logs job/nvcf-api-account-bootstrap -n nvcf.

To disable debug logging afterward:

$ kubectl patch secret nvcf-api-account-bootstrap-secret -n nvcf \
>   -p '{"stringData":{"DEBUG":"false"}}'

Common Causes:

Invalid registry credentials format - See [Incorrect Base64 Docker Credentials Format]
Wrong registry hostname - Hostname in secrets doesn’t match actual registry (e.g., using nvcr.io but credentials are for ECR)
Missing “$oauthtoken“ prefix - NGC credentials must be in format $oauthtoken:API_KEY

Solution:

Fix your secrets/<environment-name>-secrets.yaml file, then follow the “Recovering from Services Failures” steps in helmfile-installation to preserve your dependencies.

Pods Stuck in ImagePullBackOff

Symptoms: Pods cannot pull container images

Solutions:

Verify registry credentials:

$ # Check secret exists
$ kubectl get secret -n nvcf nvcf-image-pull-secret
$ 
$ # Verify credential is valid
$ kubectl get secret -n nvcf nvcf-image-pull-secret -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Verify images exist in your registry:

$ # For ECR (replace with your repository name)
$ aws ecr describe-images --repository-name <your-ecr-repository-name> --region <your-region>
$ 
$ # For NGC (if using)
$ ngc registry image list nvidia/nvcf/*

Check network connectivity from cluster to registry

Pods Stuck in Pending

Symptoms: Pods remain in Pending state

Solutions:

Check cluster resources:
```
$ kubectl describe node <node-name>
```
Verify storage class exists:
```
$ kubectl get storageclass
```

Check node selectors:

$ # View pod events
$ kubectl describe pod -n <namespace> <pod-name>
$ 
$ # Check node labels
$ kubectl get nodes --show-labels

Helm Release in Failed State After First Install

Symptom:

First helmfile sync fails partway through
Re-running helmfile sync or helmfile apply appears to succeed but things don’t work
Migrations or initialization jobs weren’t executed

Root Cause:

When a Helm installation fails, the release remains in a failed state. Subsequent commands run helm upgrade instead of helm install, which skips initialization hooks (migrations, account bootstrap, etc.).

Solution:

Fix the underlying issue (credentials, config, etc.), then follow the appropriate recovery procedure in helmfile-installation:

If only services failed (dependencies are healthy): Use the “Recovering from Services Failures” steps to preserve your dependencies
If dependencies are also broken: Follow the “Uninstalling” section in helmfile-installation

NVCA Operator Fails: nvcfbackends CRD Not Found

Symptom:

NVCA Operator installation fails with CRD not found error:

Error: customresourcedefinitions.apiextensions.k8s.io "nvcfbackends.nvcf.nvidia.io" not found

Root Cause:

A race condition occurs where Helm validates CRD references before the CRD is created by the operator’s installation hooks. This can happen during first install or when reinstalling after the CRD was deleted.

Solution:

Two changes are required in helmfile.d/03-worker.yaml.gotmpl:

Add “disableValidation: true“ to the nvca-operator release to disable OpenAPI validation:

1 wait: true
2 waitForJobs: true
3 disableValidation: true  # Add this line
4 labels:
5   release-group: workers

Remove “—dry-run=server“ from the helmDefaults.diffArgs section. This prevents server-side validation during the diff phase, which fails when the CRD doesn’t exist:

1 helmDefaults:
2   createNamespace: true
3   devel: true
4   timeout: 900
5   wait: true
6   waitForJobs: true
7   # Note: --dry-run=server removed for worker releases to avoid CRD validation failures
8   # when reinstalling nvca-operator after CRD deletion

Then run ./force-cleanup-nvcf.sh followed by HELMFILE_ENV=<environment> helmfile sync.

Accessing OpenBao Secrets (CLI)

NVCF stores most service credentials, signing keys, and internal passwords in OpenBao (a Vault-compatible secrets manager) running in the vault-system namespace. Use the bao CLI inside the OpenBao pod to inspect or manage these secrets.

For the full bao CLI reference, see the OpenBao CLI documentation. The KV secrets engine commands are documented at OpenBao KV commands.

Retrieve the Root Token

The OpenBao root token is stored in a Kubernetes secret created during initialization:

$ # Retrieve the root token
$ export BAO_ROOT_TOKEN=$(kubectl get secret openbao-server-root-token \
>   -n vault-system -o jsonpath='{.data.root_token}' | base64 -d)

The root token grants unrestricted access to all secrets in OpenBao. Treat it as a highly sensitive credential and avoid storing it in shell history or logs.

List Secrets Engines

To see all mounted secrets engines (each NVCF service has its own path):

$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao secrets list

Example output (abbreviated):

Path                             Type                        Description
----                             ----                        -----------
services/all/kv/                 kv                          n/a
services/api-keys-api/jwt/       vault-plugin-secrets-jwt    n/a
services/api-keys-api/kv/        kv                          n/a
services/ess-api/jwt/            vault-plugin-secrets-jwt    n/a
services/ess-api/kv/             kv                          n/a
services/invocation-api/jwt/     vault-plugin-secrets-jwt    n/a
services/invocation-api/kv/      kv                          n/a
services/nvcf-api/jwt/           vault-plugin-secrets-jwt    n/a
services/nvcf-api/kv/            kv                          n/a
services/nvcf-notary/kv/         kv                          n/a
services/sis-api/jwt/            vault-plugin-secrets-jwt    n/a
services/sis-api/kv/             kv                          n/a
...

List and Read Secrets

Browse secrets under a specific engine path:

$ # List top-level keys under a secrets engine
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao kv list services/nvcf-api/kv
$ 
$ # List keys in a subdirectory (paths ending in / are directories)
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao kv list services/nvcf-api/kv/cassandra
$ 
$ # Read a specific secret
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao kv get services/nvcf-api/kv/cassandra/creds

Use bao kv get -format=json <path> for machine-readable output, or bao kv get -field=<key> <path> to extract a single field.

Run Arbitrary `bao` Commands

You can run any bao subcommand by exec-ing into the pod with the root token:

$ # General pattern
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao <command> [args]
$ 
$ # Examples:
$ # Check server status
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao status
$ 
$ # List auth methods
$ kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
>   env BAO_TOKEN=$BAO_ROOT_TOKEN \
>   bao auth list

Debugging Techniques

Enabling Verbose Logging

To get more detailed logs from specific components:

For Migration Jobs:

$ # Edit the migration script configmap
$ kubectl edit configmap -n nvcf nvcf-api-migration-scripts
$ 
$ # Add to the top of the script:
$ set -x  # Enable command tracing
$ set -e  # Exit on error

Example For API Service:

$ # Set log level via environment variable
$ kubectl set env -n nvcf deployment/nvcf-api LOG_LEVEL=debug

Inspecting Failed Pods

$ # Get pod status with more details
$ kubectl get pods -n nvcf -o wide
$ 
$ # Describe a problematic pod
$ kubectl describe pod -n nvcf <pod-name>
$ 
$ # View logs (current)
$ kubectl logs -n nvcf <pod-name>
$ 
$ # View logs (previous if pod restarted)
$ kubectl logs -n nvcf <pod-name> --previous
$ 
$ # Follow logs in real-time
$ kubectl logs -n nvcf <pod-name> -f
$ 
$ # Logs for all containers in a pod
$ kubectl logs -n nvcf <pod-name> --all-containers

Checking Events

Kubernetes events often contain valuable debugging information:

$ # Get recent events for a namespace
$ kubectl get events -n nvcf --sort-by='.lastTimestamp'
$ 
$ # Get events for a specific pod
$ kubectl get events -n nvcf --field-selector involvedObject.name=<pod-name>
$ 
$ # Watch events in real-time
$ kubectl get events -n nvcf --watch

Recovery Procedures

For detailed recovery steps, see the Recovering from Partial Deployments section in helmfile-installation. This section provides quick reference for common scenarios.

Choosing a Recovery Strategy

Failure Scenario	Recovery Strategy	Reference
Dependencies failed (Cassandra, NATS, OpenBao)	Redeploy individual dependency	See `Redeploy Stuck Dependencies`_ below
Services failed (API, api-keys, etc.) but dependencies OK	Partial recovery (preserve dependencies)	See “Recovering from Services Failures” in helmfile-installation
Everything broken or uncertain state	Full uninstall and reinstall	See “Uninstalling” in helmfile-installation

Do not attempt to fix failed services by re-running helmfile sync or helmfile apply. Helm will skip initialization hooks (migrations, account bootstrap) on upgrade, resulting in a deployment that appears successful but doesn’t function correctly.

Redeploy Stuck Dependencies

Dependency services (Cassandra, NATS, OpenBao) can be safely redeployed without affecting other components:

$ # Redeploy only Cassandra
$ HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra apply
$ 
$ # Redeploy all dependencies
$ HELMFILE_ENV=<environment-name> helmfile --selector release-group=dependencies apply

Reinstalling NVCA Operator Only

If only NVCA needs reinstalling (and NVCF services are working):

$ ./force-cleanup-nvcf.sh
$ HELMFILE_ENV=<environment-name> helmfile --selector release-group=workers sync

If NVCF services are also broken, follow the “Recovering from Services Failures” steps in helmfile-installation.

NVCA Force Cleanup Script

If helmfile destroy hangs on NVCA cleanup (typically when functions are still deployed in nvcf-backend), use the force cleanup script in a new terminal. See force-cleanup-script for the full script and usage instructions.

$ ./force-cleanup-nvcf.sh --dry-run  # Preview
$ ./force-cleanup-nvcf.sh            # Execute

Database Issues

Cassandra Pods OOM During Initialization

Symptoms:

Cassandra pods restart with OOMKilled or exit code 137.
The cassandra-migrations job fails with a consistency-level error such as Cannot achieve consistency level ALL.
The install does not continue to OpenBao or NVCF services.

Diagnosis:

Check Cassandra pod restarts and the previous container state:

$ kubectl -n cassandra-system get pods
$ kubectl -n cassandra-system describe pod cassandra-0 | grep -E "OOMKilled|Exit Code: 137|Last State"

Check the migration job log for the first failed keyspace:

$ kubectl -n cassandra-system logs job/cassandra-helm-nvcf-cassandra-migrations

Root cause:

The Cassandra resource preset is too small for first boot, commit-log replay, or migration startup. The small preset is not recommended for cloud installs, and environments that still OOM on xlarge should move to 2xlarge.

Solution:

Increase the preset in your environment file, then resync Cassandra:

1 cassandra:
2   resourcesPreset: "2xlarge"

$ HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra sync

If Cassandra was interrupted during migration, also check for dirty migration state before rerunning the full install.

Cassandra Pods Restart Due to Startup Probe Failures

Symptoms:

Cassandra pods restart repeatedly without an OOMKilled exit code.
kubectl describe shows Startup probe failed or container killed: failed startup probe.
The issue appears on resource-constrained nodes (Colima on Apple Silicon, low CPU quota, emulated architectures).

Diagnosis:

Check events on the Cassandra pod:

$ kubectl -n cassandra-system describe pod cassandra-0 | grep -A 2 "startup probe"

If the events show repeated startup probe failures before Cassandra finishes initializing, the probe window is too short for the available CPU.

Root cause:

The default startup probe runs nodetool status every 10 seconds with a 5-second timeout and allows 60 failures (10-minute window). Under CPU contention or architecture emulation, Cassandra initialization can take longer than this window.

Solution:

Relax the startup probe thresholds in your environment file to widen the window from 10 minutes to 30 minutes:

1 cassandra:
2   startupProbe:
3     failureThreshold: 120
4     periodSeconds: 15
5     timeoutSeconds: 10

$ HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra sync

The local helmfile environment ships these relaxed defaults when using the self-managed stack.

Cassandra Migrations Fail After Dirty Migration State

Symptoms:

The cassandra-migrations job fails repeatedly after Cassandra restarts during a previous migration attempt.
Logs show an error similar to no migration found for version 0, or another migration error that does not identify the failed keyspace state.
A row in a schema_migrations.<keyspace> table has dirty=true.

Diagnosis:

The migration bookkeeping tables live in the schema_migrations keyspace, with one table per application keyspace. Check the keyspace named in the migration logs:

$ CPASS=$(kubectl -n cassandra-system get secret cassandra \
>   -o jsonpath='{.data.cassandra-password}' | base64 -d)
$ 
$ kubectl -n cassandra-system exec cassandra-0 -c cassandra -- \
>   /opt/bitnami/cassandra/bin/cqlsh -u cassandra -p "$CPASS" localhost \
>   -e "SELECT version, dirty FROM schema_migrations.<keyspace>;"

Root cause:

golang-migrate marks a keyspace dirty when a migration attempt is interrupted. Cassandra DDL statements may already have committed, but the migration marker remains dirty and blocks the next run.

Solution:

First verify whether the failed migration’s DDL was applied or needs manual reconciliation. After the schema matches the dirty version, clear the dirty flag and rerun the migration job:

$ kubectl -n cassandra-system exec cassandra-0 -c cassandra -- \
>   /opt/bitnami/cassandra/bin/cqlsh -u cassandra -p "$CPASS" localhost \
>   -e "UPDATE schema_migrations.<keyspace> SET dirty = false WHERE version = <version>;"
$ 
$ HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra sync

If more than one keyspace is dirty, repeat the diagnosis and reconciliation for each affected schema_migrations.<keyspace> table.

Cassandra Migration Stuck Due to Missing ConfigMap

Symptoms:

Cassandra pods are running but migration job is stuck.

$ kubectl get pods -n cassandra-system
$ 
$ ...
$ pod/cassandra-initialize-cluster-qp4ft   0/1     ContainerCreating # Stuck in ContainerCreating state perpetually

Diagnosis:

Check all Cassandra resources including ConfigMaps:

$ kubectl -n cassandra-system get all,secrets,sa,cm

Expected output should show 3 ConfigMaps:

NAME                              DATA   AGE
configmap/cassandra-init-cql      1      5d19h # This one may be missing
configmap/cassandra-init-script   1      100s  # This one may be missing
configmap/kube-root-ca.crt        1      8d

If you only see 2 ConfigMaps (missing cassandra-migrations), this is a race condition during deployment.

Root Cause:

A race condition can occur where the Cassandra migration job starts before all ConfigMaps are created, causing the deployment to hang.

Solution:

Force a sync to recreate missing resources:

$ # Use helmfile sync instead of apply to force resource recreation
$ HELMFILE_ENV="<environment>" \
> helmfile --environment default --selector name=cassandra sync

The sync command differs from apply in that it will recreate resources if needed, which resolves the ConfigMap race condition.

Alternative Solution:

If the above doesn’t work, you can safely redeploy Cassandra (it’s a dependency without complex initialization hooks):

$ # Delete the stuck migration job first
$ kubectl delete job -n cassandra-system cassandra-migrations
$ 
$ # Then redeploy Cassandra
$ HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra apply

Streaming Issues

WebRTC Streaming Fails After Function Shows Active

Symptom:

A streaming function deploys and shows ACTIVE
WebRTC clients fail to connect with NVST_R_GENERIC_ERROR
No errors in the function pod logs

Root Cause:

UDP traffic on the Kubernetes NodePort range (30000-32767) is blocked by a cloud-provider network security rule. The function health checks pass over TCP, so the function appears healthy, but the UDP media path is unreachable.

On Azure (AKS), AKS attaches a second NSG to node NICs in the managed resource group (MC_<resource-group>_<cluster>_<region>). Even if the subnet NSG allows UDP, the NIC NSG blocks it by default.

Diagnosis:

Confirm the function is ACTIVE and pods are running:

$ kubectl get pods -n nvcf-backend -l nvcf-function-name=<function-name>

On Azure, check whether the NIC NSG has a UDP allow rule:

$ MC_RG="MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION}"
$ for NIC_NSG in $(az network nsg list -g "$MC_RG" --query "[].name" -o tsv); do
$   echo "=== $NIC_NSG ==="
$   az network nsg rule list -g "$MC_RG" --nsg-name "$NIC_NSG" \
>     --query "[?protocol=='Udp']" -o table
$ done

If any NSG is missing a UDP rule, add it to all of them:

$ for NIC_NSG in $(az network nsg list -g "$MC_RG" --query "[].name" -o tsv); do
$   az network nsg rule create -g "$MC_RG" --nsg-name "$NIC_NSG" \
>     -n allow-udp-nodeports-webrtc --priority 510 \
>     --direction Inbound --access Allow --protocol Udp \
>     --source-address-prefix Internet --source-port-range "*" \
>     --destination-address-prefix "*" --destination-port-range "30000-32767"
$ done

See Cloud Provider Network Requirements for the full CSP networking checklist.

gRPC Session Resumption Fails

Symptom

A streaming client receives gRPC NotFound (grpc-status: 5) with “no existing session found” when reconnecting to a function that was previously working. The function shows active in the control plane. The HTTP status remains 200 for gRPC responses. The proxy also clears the nvcf-request-id cookie so the next retry starts a fresh session.

Root cause

The client is sending a stale nvcf-reqid header or nvcf-request-id cookie from a previous session that no longer exists. The gRPC proxy looks up the request ID, finds no matching worker session, and returns gRPC NotFound.

Common causes of stale request IDs:

The worker pod was evicted or restarted between requests.
The session timed out because the connection was idle too long.
The client is reusing a request ID from a different function or function version.

Diagnosis

Check whether the client is sending the nvcf-reqid header or nvcf-request-id cookie:

$ # Look for session-not-found errors in gRPC proxy logs.
$ kubectl logs -n nvcf -l app.kubernetes.io/name=grpc-proxy --tail=200 \
>   | grep -E "no existing session found|worker connection not found"

If the logs show “no existing session found for request id”, the client is referencing an expired session.

Resolution

Remove the stale request ID from the client and reconnect without it. The proxy creates a new session and returns a fresh request ID. Update the client to handle gRPC NotFound by discarding the stored request ID and retrying without it.

See Session Resumption for the full request ID lifecycle.

Getting Help

When requesting support, provide:

Environment details:

$ kubectl version
$ helm version
$ kubectl get nodes -o wide

Deployment configuration:
- Environment file (sanitized)
- Secrets file structure (sanitized - no actual secrets!)

Relevant logs:

$ kubectl logs -n <namespace>  <problematic-pod> > pod-logs.txt

Events:

$ kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.txt

Resource status:

$ kubectl get all -n <namespace>  -o wide > resources.txt

Appendix: NVCA Force Cleanup Script

This script forcefully removes all NVCA components from a cluster. Use it when helmfile destroy hangs on NVCA cleanup, typically because functions are still deployed in nvcf-backend.

This script bypasses normal cleanup procedures by removing finalizers. Always try helmfile destroy first.

#!/bin/bash
# =============================================================================
# force-cleanup-nvcf.sh - NVCA Component Removal Script
# =============================================================================
# This script forcefully removes all NVCA components from a cluster.
# Use this as a LAST RESORT when normal cleanup methods fail due to:
# - Stuck finalizers on namespaces or custom resources
# - Orphaned resources blocking deletion
# - Partial deployments that need complete removal
#
# WARNING: This script will FORCEFULLY remove all NVCA resources, including
# removing finalizers which bypasses normal cleanup procedures.
#
# Usage: ./force-cleanup-nvcf.sh [--dry-run]
# =============================================================================
set -euo pipefail
# --- Configuration ---
# NVCA-related namespaces
NVCA_NAMESPACES=(
    "nvcf-backend"
    "nvca-system"
    "nvca-operator"
)
# CRDs created by NVCA components
NVCA_CRDS=(
    "nvcfbackends.nvcf.nvidia.io"
)
# --- Parse Arguments ---
DRY_RUN=false
while [[ $# -gt 0 ]]; do
    case $1 in
        --dry-run)
            DRY_RUN=true
            shift
            ;;
        -h|--help)
            echo "Usage: $0 [--dry-run]"
            echo ""
            echo "Options:"
            echo "  --dry-run            Show what would be deleted without making changes"
            echo "  -h, --help           Show this help message"
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            exit 1
            ;;
    esac
done
echo "=============================================="
echo "NVCA Force Cleanup Script"
echo "=============================================="
if $DRY_RUN; then
    echo "MODE: DRY-RUN (no changes will be made)"
fi
echo ""
# --- Step 1: Show and delete function pods in nvcf-backend ---
echo ">>> Step 1: Checking for function pods in nvcf-backend namespace..."
if kubectl get namespace nvcf-backend >/dev/null 2>&1; then
    pods=$(kubectl get pods -n nvcf-backend -o name 2>/dev/null || true)
    if [[ -n "$pods" ]]; then
        echo "    Found the following pods that will be deleted:"
        kubectl get pods -n nvcf-backend -o wide 2>/dev/null || true
        echo ""
        if ! $DRY_RUN; then
            echo "    Deleting all pods in nvcf-backend..."
            kubectl delete pods -n nvcf-backend --all --force --grace-period=0 2>/dev/null || true
        else
            echo "[DRY-RUN] Would delete all pods in nvcf-backend namespace"
        fi
    else
        echo "    No pods found in nvcf-backend namespace"
    fi
else
    echo "    nvcf-backend namespace not found, skipping..."
fi
echo ""
# --- Step 2: Delete NVCFBackend Custom Resources ---
echo ">>> Step 2: Deleting NVCFBackend custom resources..."
if kubectl get crd nvcfbackends.nvcf.nvidia.io >/dev/null 2>&1; then
    echo "    Found NVCFBackend CRD, deleting all instances..."
    if ! $DRY_RUN; then
        kubectl delete nvcfbackends -A --all --wait=false 2>/dev/null || true
        echo "    Waiting 15 seconds for operator cleanup..."
        sleep 15
    else
        echo "[DRY-RUN] Would delete all NVCFBackends and wait for cleanup"
    fi
else
    echo "    NVCFBackend CRD not found, skipping..."
fi
echo ""
# --- Step 3: Delete Helm Releases ---
echo ">>> Step 3: Deleting Helm releases in NVCA namespaces..."
for ns in "${NVCA_NAMESPACES[@]}"; do
    if kubectl get namespace "$ns" >/dev/null 2>&1; then
        releases=$(helm list -n "$ns" -q 2>/dev/null || true)
        if [[ -n "$releases" ]]; then
            for release in $releases; do
                echo "    Deleting Helm release: $release (namespace: $ns)"
                if ! $DRY_RUN; then
                    helm delete -n "$ns" "$release" --wait=false 2>/dev/null || true
                fi
            done
        fi
    fi
done
echo ""
# --- Step 4: Force-delete stuck NVCFBackend resources (remove finalizers) ---
echo ">>> Step 4: Removing finalizers from stuck NVCFBackend resources..."
if kubectl get crd nvcfbackends.nvcf.nvidia.io >/dev/null 2>&1; then
    nvcfbackends=$(kubectl get nvcfbackends -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' 2>/dev/null || true)
    if [[ -n "$nvcfbackends" ]]; then
        while IFS= read -r backend; do
            if [[ -n "$backend" ]]; then
                ns=$(echo "$backend" | cut -d'/' -f1)
                name=$(echo "$backend" | cut -d'/' -f2)
                echo "    Removing finalizers from NVCFBackend: $name (namespace: $ns)"
                if ! $DRY_RUN; then
                    kubectl patch nvcfbackend "$name" -n "$ns" -p '{"metadata":{"finalizers":[]}}' --type=merge 2>/dev/null || true
                    kubectl delete nvcfbackend "$name" -n "$ns" --wait=false 2>/dev/null || true
                fi
            fi
        done <<< "$nvcfbackends"
    else
        echo "    No stuck NVCFBackend resources found"
    fi
else
    echo "    NVCFBackend CRD not found, skipping..."
fi
echo ""
# --- Step 5: Delete Namespaces ---
echo ">>> Step 5: Deleting NVCA namespaces..."
for ns in "${NVCA_NAMESPACES[@]}"; do
    if kubectl get namespace "$ns" >/dev/null 2>&1; then
        echo "    Deleting namespace: $ns"
        if ! $DRY_RUN; then
            kubectl delete namespace "$ns" --wait=false 2>/dev/null || true
        fi
    fi
done
echo "    Waiting 10 seconds for namespace deletion..."
if ! $DRY_RUN; then
    sleep 10
fi
echo ""
# --- Step 6: Force-remove finalizers from stuck namespaces ---
echo ">>> Step 6: Removing finalizers from stuck namespaces..."
for ns in "${NVCA_NAMESPACES[@]}"; do
    phase=$(kubectl get namespace "$ns" -o jsonpath='{.status.phase}' 2>/dev/null || true)
    if [[ "$phase" == "Terminating" ]]; then
        echo "    Namespace $ns is stuck in Terminating, removing finalizers..."
        if ! $DRY_RUN; then
            # First, try to remove finalizers from all resources in the namespace
            for resource_type in deployments statefulsets daemonsets replicasets pods services configmaps secrets serviceaccounts roles rolebindings; do
                kubectl get "$resource_type" -n "$ns" -o name 2>/dev/null | while read -r resource; do
                    kubectl patch "$resource" -n "$ns" -p '{"metadata":{"finalizers":[]}}' --type=merge 2>/dev/null || true
                done
            done
            
            # Remove namespace finalizers using the API
            kubectl get namespace "$ns" -o json | \
                jq '.spec.finalizers = []' | \
                kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f - 2>/dev/null || true
        fi
    fi
done
echo ""
# --- Step 7: Delete CRDs ---
echo ">>> Step 7: Deleting NVCA CRDs..."
for crd in "${NVCA_CRDS[@]}"; do
    if kubectl get crd "$crd" >/dev/null 2>&1; then
        echo "    Deleting CRD: $crd"
        if ! $DRY_RUN; then
            kubectl delete crd "$crd" --wait=false 2>/dev/null || true
        fi
    fi
done
echo ""
# --- Step 8: Verification ---
echo ">>> Step 8: Verification..."
echo ""
echo "Remaining NVCA namespaces:"
remaining_ns=0
for ns in "${NVCA_NAMESPACES[@]}"; do
    if kubectl get namespace "$ns" >/dev/null 2>&1; then
        phase=$(kubectl get namespace "$ns" -o jsonpath='{.status.phase}' 2>/dev/null || echo "Unknown")
        echo "    - $ns (status: $phase)"
        remaining_ns=$((remaining_ns + 1))
    fi
done
if [[ $remaining_ns -eq 0 ]]; then
    echo "    None - all namespaces removed successfully"
fi
echo ""
echo "Remaining NVCA CRDs:"
remaining_crds=0
for crd in "${NVCA_CRDS[@]}"; do
    if kubectl get crd "$crd" >/dev/null 2>&1; then
        echo "    - $crd"
        remaining_crds=$((remaining_crds + 1))
    fi
done
if [[ $remaining_crds -eq 0 ]]; then
    echo "    None - all CRDs removed successfully"
fi
echo ""
echo "=============================================="
if $DRY_RUN; then
    echo "DRY-RUN complete. No changes were made."
else
    if [[ $remaining_ns -eq 0 ]] && [[ $remaining_crds -eq 0 ]]; then
        echo "Cleanup complete! All NVCA resources have been removed."
    else
        echo "Cleanup finished with some resources remaining."
        echo "You may need to run this script again or investigate manually."
    fi
fi
echo "=============================================="

force-cleanup-nvcf.sh

Usage:

Download or copy the script to your working directory
Make executable: chmod +x force-cleanup-nvcf.sh
Preview what will be deleted:
```
$ ./force-cleanup-nvcf.sh --dry-run
```
Run the cleanup:
```
$ ./force-cleanup-nvcf.sh
```

What the script does:

Lists and force-deletes all function pods in nvcf-backend namespace
Deletes all NVCFBackend custom resources
Deletes Helm releases in NVCA namespaces
Removes finalizers from stuck NVCFBackend resources
Deletes the NVCA namespaces (nvcf-backend, nvca-system, nvca-operator)
Removes finalizers from namespaces stuck in Terminating state
Deletes the NVCFBackend CRD
Verifies cleanup completion

Configuration Issues and Best Practices

Incorrect Base64 Docker Credentials Format

Registry Credential Change Not Taking Effect

Identifying Deployment Issues

Installation & Deployment Issues

Account Bootstrap Job Failures

Pods Stuck in ImagePullBackOff

Pods Stuck in Pending

Helm Release in Failed State After First Install

NVCA Operator Fails: nvcfbackends CRD Not Found

Accessing OpenBao Secrets (CLI)

Retrieve the Root Token

List Secrets Engines

List and Read Secrets

Run Arbitrary bao Commands

Debugging Techniques

Enabling Verbose Logging

Inspecting Failed Pods

Checking Events

Recovery Procedures

Choosing a Recovery Strategy

Redeploy Stuck Dependencies

Reinstalling NVCA Operator Only

NVCA Force Cleanup Script

Database Issues

Cassandra Pods OOM During Initialization

Cassandra Pods Restart Due to Startup Probe Failures

Cassandra Migrations Fail After Dirty Migration State

Cassandra Migration Stuck Due to Missing ConfigMap

Streaming Issues

WebRTC Streaming Fails After Function Shows Active

gRPC Session Resumption Fails

Symptom

Root cause

Diagnosis

Resolution

Getting Help

Appendix: NVCA Force Cleanup Script

Run Arbitrary `bao` Commands