Troubleshooting / FAQ#
This appendix contains common issues, tips, and clarifications learned from deploying self-hosted NVCF.
Configuration Issues and Best Practices#
Incorrect Base64 Docker Credentials Format#
Symptom:
API migration job fails during installation
Error appears as a “timeout” in the logs
Re-running
helmfile syncorhelmfile applyappears to succeed but the deployment doesn’t work properlyFunctions fail to deploy or pull images
Root Cause:
The base64-encoded Docker credential in secrets.yaml was incorrectly formatted. A common mistake is encoding only the NGC API key instead of the full basic auth credential in the format $oauthtoken:API_KEY.
Incorrect (will fail):
# ❌ WRONG - Only encoding the API key
echo -n 'nvapi-1234567890abcdef' | base64
# Results in: bnZhcGktMTIzNDU2Nzg5MGFiY2RlZg==
Correct:
# ✅ CORRECT - Encoding the full credential in basic auth format
echo -n '$oauthtoken:nvapi-1234567890abcdef' | base64
# Results in: JG9hdXRodG9rZW46bnZhcGktMTIzNDU2Nzg5MGFiY2RlZg==
How to Diagnose:
Check the migration job logs specifically:
kubectl logs -n nvcf job/nvcf-api-migration -c migration
If you don’t see detailed errors, add debug output to migration scripts:
# Add set -x to the migration script for verbose output kubectl edit configmap -n nvcf nvcf-api-migration-scripts # Add 'set -x' at the top of the script
Fix your secrets.yaml with correct base64 credential, then follow the Choosing a Recovery Strategy.
How to Prevent:
Always use the correct format: Encode
$oauthtoken:YOUR_API_KEY, not just the API keyVerify before deploying: Decode your base64 string to verify it’s correct:
# Verify your encoded credential echo 'YOUR_BASE64_STRING' | base64 -d # Should output: $oauthtoken:nvapi-1234567890abcdef
Test NGC authentication: Before deploying, test that your credential works:
# Test NGC login with your credential echo 'YOUR_BASE64_STRING' | base64 -d | IFS=: read username password docker login nvcr.io -u "$username" -p "$password"
Identifying Deployment Issues#
Use these commands to diagnose deployment problems. For phase-by-phase monitoring during installation, see the Deployment Progression section in Control Plane Installation.
Find Stuck Deployments:
# Pods stuck in pending state
kubectl get pods -A | grep Pending
# Pods with image pull issues
kubectl get pods -A | grep -E "ImagePullBackOff|ErrImagePull"
# Pods in crash loops
kubectl get pods -A | grep CrashLoopBackOff
# Check recent events for errors
kubectl get events --sort-by='.lastTimestamp' -A | grep -i error | tail -10
Resource Check:
# Check node resources
kubectl top nodes
# Check pod resource usage
kubectl top pods -A | grep -E "nvcf|nats|cassandra|openbao"
Installation & Deployment Issues#
Account Bootstrap Job Failures#
Symptom:
helmfile synchangs or fails during the services phaseEvents show
BackoffLimitExceededfornvcf-api-account-bootstrapBootstrap pod shows
CrashLoopBackOfforErrorstatus
Diagnosis:
Watch events in real-time (run this as soon as helmfile reaches services phase):
kubectl get events -n nvcf -w
Check the bootstrap job logs:
kubectl logs job/nvcf-api-account-bootstrap -n nvcf
Check the NVCF API logs for detailed error messages:
kubectl logs -n nvcf -l app.kubernetes.io/name=nvcf-api --tail=100
Note
The bootstrap job auto-deletes after ~5 minutes (ttlSecondsAfterFinished: 300). Monitor events to catch failures in real-time.
Tip
Enable debug logging for the bootstrap job. The account bootstrap script supports
a DEBUG environment variable that enables verbose output. To enable it before
redeploying, patch the bootstrap secret:
kubectl patch secret nvcf-api-account-bootstrap-secret -n nvcf \
-p '{"stringData":{"DEBUG":"true"}}'
Then follow the “Recovering from Services Failures” steps in Control Plane Installation
to redeploy. The next bootstrap job run will include detailed debug logs visible via
kubectl logs job/nvcf-api-account-bootstrap -n nvcf.
To disable debug logging afterward:
kubectl patch secret nvcf-api-account-bootstrap-secret -n nvcf \
-p '{"stringData":{"DEBUG":"false"}}'
Common Causes:
Invalid registry credentials format - See Incorrect Base64 Docker Credentials Format
Wrong registry hostname - Hostname in secrets doesn’t match actual registry (e.g., using
nvcr.iobut credentials are for ECR)Missing ``$oauthtoken`` prefix - NGC credentials must be in format
$oauthtoken:API_KEY
Solution:
Fix your secrets/<environment-name>-secrets.yaml file, then follow the “Recovering from Services Failures” steps in Control Plane Installation to preserve your dependencies.
Pods Stuck in ImagePullBackOff#
Symptoms: Pods cannot pull container images
Solutions:
Verify registry credentials:
# Check secret exists kubectl get secret -n nvcf nvcf-image-pull-secret # Verify credential is valid kubectl get secret -n nvcf nvcf-image-pull-secret -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Verify images exist in your registry:
# For ECR (replace with your repository name) aws ecr describe-images --repository-name <your-ecr-repository-name> --region <your-region> # For NGC (if using) ngc registry image list 0833294136851237/nvcf-ncp-staging/*
Check network connectivity from cluster to registry
Pods Stuck in Pending#
Symptoms: Pods remain in Pending state
Solutions:
Check cluster resources:
kubectl describe node <node-name>
Verify storage class exists:
kubectl get storageclass
Check node selectors:
# View pod events kubectl describe pod -n <namespace> <pod-name> # Check node labels kubectl get nodes --show-labels
Helm Release in Failed State After First Install#
Symptom:
First
helmfile syncfails partway throughRe-running
helmfile syncorhelmfile applyappears to succeed but things don’t workMigrations or initialization jobs weren’t executed
Root Cause:
When a Helm installation fails, the release remains in a failed state. Subsequent commands run helm upgrade instead of helm install, which skips initialization hooks (migrations, account bootstrap, etc.).
Solution:
Fix the underlying issue (credentials, config, etc.), then follow the appropriate recovery procedure in Control Plane Installation:
If only services failed (dependencies are healthy): Use the “Recovering from Services Failures” steps to preserve your dependencies
If dependencies are also broken: Follow the “Uninstalling” section in Control Plane Installation
NVCA Operator Fails: nvcfbackends CRD Not Found#
Symptom:
NVCA Operator installation fails with CRD not found error:
Error: customresourcedefinitions.apiextensions.k8s.io "nvcfbackends.nvcf.nvidia.io" not found
Root Cause:
A race condition occurs where Helm validates CRD references before the CRD is created by the operator’s installation hooks. This can happen during first install or when reinstalling after the CRD was deleted.
Solution:
Two changes are required in helmfile.d/03-worker.yaml.gotmpl:
Add ``disableValidation: true`` to the nvca-operator release to disable OpenAPI validation:
wait: true
waitForJobs: true
disableValidation: true # Add this line
labels:
release-group: workers
Remove ``–dry-run=server`` from the
helmDefaults.diffArgssection. This prevents server-side validation during the diff phase, which fails when the CRD doesn’t exist:
helmDefaults:
createNamespace: true
devel: true
timeout: 900
wait: true
waitForJobs: true
# Note: --dry-run=server removed for worker releases to avoid CRD validation failures
# when reinstalling nvca-operator after CRD deletion
Then run ./force-cleanup-nvcf.sh followed by HELMFILE_ENV=<environment> helmfile sync.
Accessing OpenBao Secrets (CLI)#
NVCF stores most service credentials, signing keys, and internal passwords in
OpenBao (a Vault-compatible secrets manager) running in the
vault-system namespace. Use the bao CLI inside the OpenBao pod to inspect or
manage these secrets.
See also
For the full bao CLI reference, see the
OpenBao CLI documentation.
The KV secrets engine commands are documented at
OpenBao KV commands.
Retrieve the Root Token#
The OpenBao root token is stored in a Kubernetes secret created during initialization:
# Retrieve the root token
export BAO_ROOT_TOKEN=$(kubectl get secret openbao-server-root-token \
-n vault-system -o jsonpath='{.data.root_token}' | base64 -d)
Warning
The root token grants unrestricted access to all secrets in OpenBao. Treat it as a highly sensitive credential and avoid storing it in shell history or logs.
List Secrets Engines#
To see all mounted secrets engines (each NVCF service has its own path):
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao secrets list
Example output (abbreviated):
Path Type Description
---- ---- -----------
services/all/kv/ kv n/a
services/api-keys-api/jwt/ vault-plugin-secrets-jwt n/a
services/api-keys-api/kv/ kv n/a
services/ess-api/jwt/ vault-plugin-secrets-jwt n/a
services/ess-api/kv/ kv n/a
services/invocation-api/jwt/ vault-plugin-secrets-jwt n/a
services/invocation-api/kv/ kv n/a
services/nvcf-api/jwt/ vault-plugin-secrets-jwt n/a
services/nvcf-api/kv/ kv n/a
services/nvcf-notary/kv/ kv n/a
services/sis-api/jwt/ vault-plugin-secrets-jwt n/a
services/sis-api/kv/ kv n/a
...
List and Read Secrets#
Browse secrets under a specific engine path:
# List top-level keys under a secrets engine
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao kv list services/nvcf-api/kv
# List keys in a subdirectory (paths ending in / are directories)
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao kv list services/nvcf-api/kv/cassandra
# Read a specific secret
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao kv get services/nvcf-api/kv/cassandra/creds
Tip
Use bao kv get -format=json <path> for machine-readable output, or
bao kv get -field=<key> <path> to extract a single field.
Run Arbitrary bao Commands#
You can run any bao subcommand by exec-ing into the pod with the root token:
# General pattern
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao <command> [args]
# Examples:
# Check server status
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao status
# List auth methods
kubectl exec -it openbao-server-0 -c openbao -n vault-system -- \
env BAO_TOKEN=$BAO_ROOT_TOKEN \
bao auth list
Debugging Techniques#
Enabling Verbose Logging#
To get more detailed logs from specific components:
For Migration Jobs:
# Edit the migration script configmap
kubectl edit configmap -n nvcf nvcf-api-migration-scripts
# Add to the top of the script:
set -x # Enable command tracing
set -e # Exit on error
Example For API Service:
# Set log level via environment variable
kubectl set env -n nvcf deployment/nvcf-api LOG_LEVEL=debug
Inspecting Failed Pods#
# Get pod status with more details
kubectl get pods -n nvcf -o wide
# Describe a problematic pod
kubectl describe pod -n nvcf <pod-name>
# View logs (current)
kubectl logs -n nvcf <pod-name>
# View logs (previous if pod restarted)
kubectl logs -n nvcf <pod-name> --previous
# Follow logs in real-time
kubectl logs -n nvcf <pod-name> -f
# Logs for all containers in a pod
kubectl logs -n nvcf <pod-name> --all-containers
Checking Events#
Kubernetes events often contain valuable debugging information:
# Get recent events for a namespace
kubectl get events -n nvcf --sort-by='.lastTimestamp'
# Get events for a specific pod
kubectl get events -n nvcf --field-selector involvedObject.name=<pod-name>
# Watch events in real-time
kubectl get events -n nvcf --watch
Recovery Procedures#
For detailed recovery steps, see the Recovering from Partial Deployments section in Control Plane Installation. This section provides quick reference for common scenarios.
Choosing a Recovery Strategy#
Failure Scenario |
Recovery Strategy |
Reference |
|---|---|---|
Dependencies failed (Cassandra, NATS, OpenBao) |
Redeploy individual dependency |
See Redeploy Stuck Dependencies below |
Services failed (API, api-keys, etc.) but dependencies OK |
Partial recovery (preserve dependencies) |
See “Recovering from Services Failures” in Control Plane Installation |
Everything broken or uncertain state |
Full uninstall and reinstall |
See “Uninstalling” in Control Plane Installation |
Warning
Do not attempt to fix failed services by re-running helmfile sync or helmfile apply. Helm will skip initialization hooks (migrations, account bootstrap) on upgrade, resulting in a deployment that appears successful but doesn’t function correctly.
Redeploy Stuck Dependencies#
Dependency services (Cassandra, NATS, OpenBao) can be safely redeployed without affecting other components:
# Redeploy only Cassandra
HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra apply
# Redeploy all dependencies
HELMFILE_ENV=<environment-name> helmfile --selector release-group=dependencies apply
Reinstalling NVCA Operator Only#
If only NVCA needs reinstalling (and NVCF services are working):
./force-cleanup-nvcf.sh
HELMFILE_ENV=<environment-name> helmfile --selector release-group=workers sync
If NVCF services are also broken, follow the “Recovering from Services Failures” steps in Control Plane Installation.
NVCA Force Cleanup Script#
If helmfile destroy hangs on NVCA cleanup (typically when functions are still deployed in nvcf-backend), use the force cleanup script in a new terminal. See Appendix: NVCA Force Cleanup Script for the full script and usage instructions.
./force-cleanup-nvcf.sh --dry-run # Preview
./force-cleanup-nvcf.sh # Execute
Database Issues#
Cassandra Migration Stuck Due to Missing ConfigMap#
Symptoms:
Cassandra pods are running but migration job is stuck.
kubectl get pods -n cassandra-system
...
pod/cassandra-initialize-cluster-qp4ft 0/1 ContainerCreating # Stuck in ContainerCreating state perpetually
Diagnosis:
Check all Cassandra resources including ConfigMaps:
kubectl -n cassandra-system get all,secrets,sa,cm
Expected output should show 3 ConfigMaps:
NAME DATA AGE
configmap/cassandra-init-cql 1 5d19h # This one may be missing
configmap/cassandra-init-script 1 100s # This one may be missing
configmap/kube-root-ca.crt 1 8d
If you only see 2 ConfigMaps (missing cassandra-migrations), this is a race condition during deployment.
Root Cause:
A race condition can occur where the Cassandra migration job starts before all ConfigMaps are created, causing the deployment to hang.
Solution:
Force a sync to recreate missing resources:
# Use helmfile sync instead of apply to force resource recreation
HELMFILE_ENV="<environment>" \
helmfile --environment default --selector name=cassandra sync
Note
The sync command differs from apply in that it will recreate resources if needed, which resolves the ConfigMap race condition.
Alternative Solution:
If the above doesn’t work, you can safely redeploy Cassandra (it’s a dependency without complex initialization hooks):
# Delete the stuck migration job first
kubectl delete job -n cassandra-system cassandra-migrations
# Then redeploy Cassandra
HELMFILE_ENV=<environment-name> helmfile --selector name=cassandra apply
Getting Help#
When requesting support, provide:
Environment details:
kubectl version helm version kubectl get nodes -o wide
Deployment configuration:
Environment file (sanitized)
Secrets file structure (sanitized - no actual secrets!)
Relevant logs:
kubectl logs -n <namespace> <problematic-pod> > pod-logs.txt
Events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.txt
Resource status:
kubectl get all -n <namespace> -o wide > resources.txt
Appendix: NVCA Force Cleanup Script#
This script forcefully removes all NVCA components from a cluster. Use it when helmfile destroy hangs on NVCA cleanup, typically because functions are still deployed in nvcf-backend.
Warning
This script bypasses normal cleanup procedures by removing finalizers. Always try helmfile destroy first.
1#!/bin/bash
2# =============================================================================
3# force-cleanup-nvcf.sh - NVCA Component Removal Script
4# =============================================================================
5# This script forcefully removes all NVCA components from a cluster.
6# Use this as a LAST RESORT when normal cleanup methods fail due to:
7# - Stuck finalizers on namespaces or custom resources
8# - Orphaned resources blocking deletion
9# - Partial deployments that need complete removal
10#
11# WARNING: This script will FORCEFULLY remove all NVCA resources, including
12# removing finalizers which bypasses normal cleanup procedures.
13#
14# Usage: ./force-cleanup-nvcf.sh [--dry-run]
15# =============================================================================
16
17set -euo pipefail
18
19# --- Configuration ---
20# NVCA-related namespaces
21NVCA_NAMESPACES=(
22 "nvcf-backend"
23 "nvca-system"
24 "nvca-operator"
25)
26
27# CRDs created by NVCA components
28NVCA_CRDS=(
29 "nvcfbackends.nvcf.nvidia.io"
30)
31
32# --- Parse Arguments ---
33DRY_RUN=false
34
35while [[ $# -gt 0 ]]; do
36 case $1 in
37 --dry-run)
38 DRY_RUN=true
39 shift
40 ;;
41 -h|--help)
42 echo "Usage: $0 [--dry-run]"
43 echo ""
44 echo "Options:"
45 echo " --dry-run Show what would be deleted without making changes"
46 echo " -h, --help Show this help message"
47 exit 0
48 ;;
49 *)
50 echo "Unknown option: $1"
51 exit 1
52 ;;
53 esac
54done
55
56echo "=============================================="
57echo "NVCA Force Cleanup Script"
58echo "=============================================="
59if $DRY_RUN; then
60 echo "MODE: DRY-RUN (no changes will be made)"
61fi
62echo ""
63
64# --- Step 1: Show and delete function pods in nvcf-backend ---
65echo ">>> Step 1: Checking for function pods in nvcf-backend namespace..."
66if kubectl get namespace nvcf-backend >/dev/null 2>&1; then
67 pods=$(kubectl get pods -n nvcf-backend -o name 2>/dev/null || true)
68 if [[ -n "$pods" ]]; then
69 echo " Found the following pods that will be deleted:"
70 kubectl get pods -n nvcf-backend -o wide 2>/dev/null || true
71 echo ""
72 if ! $DRY_RUN; then
73 echo " Deleting all pods in nvcf-backend..."
74 kubectl delete pods -n nvcf-backend --all --force --grace-period=0 2>/dev/null || true
75 else
76 echo "[DRY-RUN] Would delete all pods in nvcf-backend namespace"
77 fi
78 else
79 echo " No pods found in nvcf-backend namespace"
80 fi
81else
82 echo " nvcf-backend namespace not found, skipping..."
83fi
84echo ""
85
86# --- Step 2: Delete NVCFBackend Custom Resources ---
87echo ">>> Step 2: Deleting NVCFBackend custom resources..."
88if kubectl get crd nvcfbackends.nvcf.nvidia.io >/dev/null 2>&1; then
89 echo " Found NVCFBackend CRD, deleting all instances..."
90 if ! $DRY_RUN; then
91 kubectl delete nvcfbackends -A --all --wait=false 2>/dev/null || true
92 echo " Waiting 15 seconds for operator cleanup..."
93 sleep 15
94 else
95 echo "[DRY-RUN] Would delete all NVCFBackends and wait for cleanup"
96 fi
97else
98 echo " NVCFBackend CRD not found, skipping..."
99fi
100echo ""
101
102# --- Step 3: Delete Helm Releases ---
103echo ">>> Step 3: Deleting Helm releases in NVCA namespaces..."
104for ns in "${NVCA_NAMESPACES[@]}"; do
105 if kubectl get namespace "$ns" >/dev/null 2>&1; then
106 releases=$(helm list -n "$ns" -q 2>/dev/null || true)
107 if [[ -n "$releases" ]]; then
108 for release in $releases; do
109 echo " Deleting Helm release: $release (namespace: $ns)"
110 if ! $DRY_RUN; then
111 helm delete -n "$ns" "$release" --wait=false 2>/dev/null || true
112 fi
113 done
114 fi
115 fi
116done
117echo ""
118
119# --- Step 4: Force-delete stuck NVCFBackend resources (remove finalizers) ---
120echo ">>> Step 4: Removing finalizers from stuck NVCFBackend resources..."
121if kubectl get crd nvcfbackends.nvcf.nvidia.io >/dev/null 2>&1; then
122 nvcfbackends=$(kubectl get nvcfbackends -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' 2>/dev/null || true)
123 if [[ -n "$nvcfbackends" ]]; then
124 while IFS= read -r backend; do
125 if [[ -n "$backend" ]]; then
126 ns=$(echo "$backend" | cut -d'/' -f1)
127 name=$(echo "$backend" | cut -d'/' -f2)
128 echo " Removing finalizers from NVCFBackend: $name (namespace: $ns)"
129 if ! $DRY_RUN; then
130 kubectl patch nvcfbackend "$name" -n "$ns" -p '{"metadata":{"finalizers":[]}}' --type=merge 2>/dev/null || true
131 kubectl delete nvcfbackend "$name" -n "$ns" --wait=false 2>/dev/null || true
132 fi
133 fi
134 done <<< "$nvcfbackends"
135 else
136 echo " No stuck NVCFBackend resources found"
137 fi
138else
139 echo " NVCFBackend CRD not found, skipping..."
140fi
141echo ""
142
143# --- Step 5: Delete Namespaces ---
144echo ">>> Step 5: Deleting NVCA namespaces..."
145for ns in "${NVCA_NAMESPACES[@]}"; do
146 if kubectl get namespace "$ns" >/dev/null 2>&1; then
147 echo " Deleting namespace: $ns"
148 if ! $DRY_RUN; then
149 kubectl delete namespace "$ns" --wait=false 2>/dev/null || true
150 fi
151 fi
152done
153echo " Waiting 10 seconds for namespace deletion..."
154if ! $DRY_RUN; then
155 sleep 10
156fi
157echo ""
158
159# --- Step 6: Force-remove finalizers from stuck namespaces ---
160echo ">>> Step 6: Removing finalizers from stuck namespaces..."
161for ns in "${NVCA_NAMESPACES[@]}"; do
162 phase=$(kubectl get namespace "$ns" -o jsonpath='{.status.phase}' 2>/dev/null || true)
163 if [[ "$phase" == "Terminating" ]]; then
164 echo " Namespace $ns is stuck in Terminating, removing finalizers..."
165 if ! $DRY_RUN; then
166 # First, try to remove finalizers from all resources in the namespace
167 for resource_type in deployments statefulsets daemonsets replicasets pods services configmaps secrets serviceaccounts roles rolebindings; do
168 kubectl get "$resource_type" -n "$ns" -o name 2>/dev/null | while read -r resource; do
169 kubectl patch "$resource" -n "$ns" -p '{"metadata":{"finalizers":[]}}' --type=merge 2>/dev/null || true
170 done
171 done
172
173 # Remove namespace finalizers using the API
174 kubectl get namespace "$ns" -o json | \
175 jq '.spec.finalizers = []' | \
176 kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f - 2>/dev/null || true
177 fi
178 fi
179done
180echo ""
181
182# --- Step 7: Delete CRDs ---
183echo ">>> Step 7: Deleting NVCA CRDs..."
184for crd in "${NVCA_CRDS[@]}"; do
185 if kubectl get crd "$crd" >/dev/null 2>&1; then
186 echo " Deleting CRD: $crd"
187 if ! $DRY_RUN; then
188 kubectl delete crd "$crd" --wait=false 2>/dev/null || true
189 fi
190 fi
191done
192echo ""
193
194# --- Step 8: Verification ---
195echo ">>> Step 8: Verification..."
196echo ""
197echo "Remaining NVCA namespaces:"
198remaining_ns=0
199for ns in "${NVCA_NAMESPACES[@]}"; do
200 if kubectl get namespace "$ns" >/dev/null 2>&1; then
201 phase=$(kubectl get namespace "$ns" -o jsonpath='{.status.phase}' 2>/dev/null || echo "Unknown")
202 echo " - $ns (status: $phase)"
203 remaining_ns=$((remaining_ns + 1))
204 fi
205done
206if [[ $remaining_ns -eq 0 ]]; then
207 echo " None - all namespaces removed successfully"
208fi
209
210echo ""
211echo "Remaining NVCA CRDs:"
212remaining_crds=0
213for crd in "${NVCA_CRDS[@]}"; do
214 if kubectl get crd "$crd" >/dev/null 2>&1; then
215 echo " - $crd"
216 remaining_crds=$((remaining_crds + 1))
217 fi
218done
219if [[ $remaining_crds -eq 0 ]]; then
220 echo " None - all CRDs removed successfully"
221fi
222
223echo ""
224echo "=============================================="
225if $DRY_RUN; then
226 echo "DRY-RUN complete. No changes were made."
227else
228 if [[ $remaining_ns -eq 0 ]] && [[ $remaining_crds -eq 0 ]]; then
229 echo "Cleanup complete! All NVCA resources have been removed."
230 else
231 echo "Cleanup finished with some resources remaining."
232 echo "You may need to run this script again or investigate manually."
233 fi
234fi
235echo "=============================================="
Usage:
Download or copy the script to your working directory
Make executable:
chmod +x force-cleanup-nvcf.shPreview what will be deleted:
./force-cleanup-nvcf.sh --dry-runRun the cleanup:
./force-cleanup-nvcf.sh
What the script does:
Lists and force-deletes all function pods in
nvcf-backendnamespaceDeletes all NVCFBackend custom resources
Deletes Helm releases in NVCA namespaces
Removes finalizers from stuck NVCFBackend resources
Deletes the NVCA namespaces (
nvcf-backend,nvca-system,nvca-operator)Removes finalizers from namespaces stuck in Terminating state
Deletes the NVCFBackend CRD
Verifies cleanup completion