Troubleshooting / FAQ
Troubleshooting / FAQ
Troubleshooting / FAQ
This appendix contains common issues, tips, and clarifications learned from deploying self-hosted NVCF.
Symptom:
helmfile sync or helmfile apply appears to succeed but the deployment doesn’t work properlyRoot Cause:
The base64-encoded Docker credential in secrets.yaml was incorrectly formatted. A common mistake is encoding only the NGC API key instead of the full basic auth credential in the format $oauthtoken:API_KEY.
Incorrect (will fail):
Correct:
How to Diagnose:
Check the migration job logs specifically:
If you don’t see detailed errors, add debug output to migration scripts:
Fix your secrets.yaml with correct base64 credential, then follow the clean-install-procedure.
How to Prevent:
Always use the correct format: Encode $oauthtoken:YOUR_API_KEY, not just the API key
Verify before deploying: Decode your base64 string to verify it’s correct:
Test NGC authentication: Before deploying, test that your credential works:
Use these commands to diagnose deployment problems. For phase-by-phase monitoring during installation, see the Deployment Progression section in helmfile-installation.
Find Stuck Deployments:
Resource Check:
Symptom:
helmfile sync hangs or fails during the services phaseBackoffLimitExceeded for nvcf-api-account-bootstrapCrashLoopBackOff or Error statusDiagnosis:
Watch events in real-time (run this as soon as helmfile reaches services phase):
Check the bootstrap job logs:
Check the NVCF API logs for detailed error messages:
The bootstrap job auto-deletes after ~5 minutes (ttlSecondsAfterFinished: 300). Monitor events to catch failures in real-time.
Enable debug logging for the bootstrap job. The account bootstrap script supports
a DEBUG environment variable that enables verbose output. To enable it before
redeploying, patch the bootstrap secret:
Then follow the “Recovering from Services Failures” steps in helmfile-installation
to redeploy. The next bootstrap job run will include detailed debug logs visible via
kubectl logs job/nvcf-api-account-bootstrap -n nvcf.
To disable debug logging afterward:
Common Causes:
nvcr.io but credentials are for ECR)$oauthtoken:API_KEYSolution:
Fix your secrets/<environment-name>-secrets.yaml file, then follow the “Recovering from Services Failures” steps in helmfile-installation to preserve your dependencies.
Symptoms: Pods cannot pull container images
Solutions:
Verify registry credentials:
Verify images exist in your registry:
Check network connectivity from cluster to registry
Symptoms: Pods remain in Pending state
Solutions:
Check cluster resources:
Verify storage class exists:
Check node selectors:
Symptom:
helmfile sync fails partway throughhelmfile sync or helmfile apply appears to succeed but things don’t workRoot Cause:
When a Helm installation fails, the release remains in a failed state. Subsequent commands run helm upgrade instead of helm install, which skips initialization hooks (migrations, account bootstrap, etc.).
Solution:
Fix the underlying issue (credentials, config, etc.), then follow the appropriate recovery procedure in helmfile-installation:
Symptom:
NVCA Operator installation fails with CRD not found error:
Root Cause:
A race condition occurs where Helm validates CRD references before the CRD is created by the operator’s installation hooks. This can happen during first install or when reinstalling after the CRD was deleted.
Solution:
Two changes are required in helmfile.d/03-worker.yaml.gotmpl:
helmDefaults.diffArgs section. This prevents server-side validation during the diff phase, which fails when the CRD doesn’t exist:Then run ./force-cleanup-nvcf.sh followed by HELMFILE_ENV=<environment> helmfile sync.
NVCF stores most service credentials, signing keys, and internal passwords in
OpenBao (a Vault-compatible secrets manager) running in the
vault-system namespace. Use the bao CLI inside the OpenBao pod to inspect or
manage these secrets.
For the full bao CLI reference, see the
OpenBao CLI documentation.
The KV secrets engine commands are documented at
OpenBao KV commands.
The OpenBao root token is stored in a Kubernetes secret created during initialization:
The root token grants unrestricted access to all secrets in OpenBao. Treat it as a highly sensitive credential and avoid storing it in shell history or logs.
To see all mounted secrets engines (each NVCF service has its own path):
Example output (abbreviated):
Browse secrets under a specific engine path:
Use bao kv get -format=json <path> for machine-readable output, or
bao kv get -field=<key> <path> to extract a single field.
bao CommandsYou can run any bao subcommand by exec-ing into the pod with the root token:
To get more detailed logs from specific components:
For Migration Jobs:
Example For API Service:
Kubernetes events often contain valuable debugging information:
For detailed recovery steps, see the Recovering from Partial Deployments section in helmfile-installation. This section provides quick reference for common scenarios.
Do not attempt to fix failed services by re-running helmfile sync or helmfile apply. Helm will skip initialization hooks (migrations, account bootstrap) on upgrade, resulting in a deployment that appears successful but doesn’t function correctly.
Dependency services (Cassandra, NATS, OpenBao) can be safely redeployed without affecting other components:
If only NVCA needs reinstalling (and NVCF services are working):
If NVCF services are also broken, follow the “Recovering from Services Failures” steps in helmfile-installation.
If helmfile destroy hangs on NVCA cleanup (typically when functions are still deployed in nvcf-backend), use the force cleanup script in a new terminal. See force-cleanup-script for the full script and usage instructions.
Symptoms:
Cassandra pods are running but migration job is stuck.
Diagnosis:
Check all Cassandra resources including ConfigMaps:
Expected output should show 3 ConfigMaps:
If you only see 2 ConfigMaps (missing cassandra-migrations), this is a race condition during deployment.
Root Cause:
A race condition can occur where the Cassandra migration job starts before all ConfigMaps are created, causing the deployment to hang.
Solution:
Force a sync to recreate missing resources:
The sync command differs from apply in that it will recreate resources if needed, which resolves the ConfigMap race condition.
Alternative Solution:
If the above doesn’t work, you can safely redeploy Cassandra (it’s a dependency without complex initialization hooks):
Symptom:
NVST_R_GENERIC_ERRORRoot Cause:
UDP traffic on the Kubernetes NodePort range (30000-32767) is blocked by a cloud-provider network security rule. The function health checks pass over TCP, so the function appears healthy, but the UDP media path is unreachable.
On Azure (AKS), AKS attaches a second NSG to node NICs in the managed
resource group (MC_<resource-group>_<cluster>_<region>). Even if the subnet NSG
allows UDP, the NIC NSG blocks it by default.
Diagnosis:
Confirm the function is ACTIVE and pods are running:
On Azure, check whether the NIC NSG has a UDP allow rule:
If any NSG is missing a UDP rule, add it to all of them:
See Cloud Provider Network Requirements for the full CSP networking checklist.
When requesting support, provide:
Environment details:
Deployment configuration:
Relevant logs:
Events:
Resource status:
This script forcefully removes all NVCA components from a cluster. Use it when helmfile destroy hangs on NVCA cleanup, typically because functions are still deployed in nvcf-backend.
This script bypasses normal cleanup procedures by removing finalizers. Always try helmfile destroy first.
Usage:
Download or copy the script to your working directory
Make executable: chmod +x force-cleanup-nvcf.sh
Preview what will be deleted:
Run the cleanup:
What the script does:
nvcf-backend namespacenvcf-backend, nvca-system, nvca-operator)