Troubleshooting / FAQ
Troubleshooting / FAQ
This appendix contains common issues, tips, and clarifications learned from deploying self-hosted NVCF.
Configuration Issues and Best Practices
Incorrect Base64 Docker Credentials Format
Symptom:
- API migration job fails during installation
- Error appears as a “timeout” in the logs
- Re-running
helmfile syncorhelmfile applyappears to succeed but the deployment doesn’t work properly - Functions fail to deploy or pull images
Root Cause:
The base64-encoded Docker credential in secrets.yaml was incorrectly formatted. A common mistake is encoding only the NGC API key instead of the full basic auth credential in the format $oauthtoken:API_KEY.
Incorrect (will fail):
Correct:
How to Diagnose:
-
Check the migration job logs specifically:
-
If you don’t see detailed errors, add debug output to migration scripts:
-
Fix your secrets.yaml with correct base64 credential, then follow the clean-install-procedure.
How to Prevent:
-
Always use the correct format: Encode
$oauthtoken:YOUR_API_KEY, not just the API key -
Verify before deploying: Decode your base64 string to verify it’s correct:
-
Test NGC authentication: Before deploying, test that your credential works:
Identifying Deployment Issues
Use these commands to diagnose deployment problems. For phase-by-phase monitoring during installation, see the Deployment Progression section in helmfile-installation.
Find Stuck Deployments:
Resource Check:
Installation & Deployment Issues
Account Bootstrap Job Failures
Symptom:
helmfile synchangs or fails during the services phase- Events show
BackoffLimitExceededfornvcf-api-account-bootstrap - Bootstrap pod shows
CrashLoopBackOfforErrorstatus
Diagnosis:
-
Watch events in real-time (run this as soon as helmfile reaches services phase):
-
Check the bootstrap job logs:
-
Check the NVCF API logs for detailed error messages:
The bootstrap job auto-deletes after ~5 minutes (ttlSecondsAfterFinished: 300). Monitor events to catch failures in real-time.
Enable debug logging for the bootstrap job. The account bootstrap script supports
a DEBUG environment variable that enables verbose output. To enable it before
redeploying, patch the bootstrap secret:
Then follow the “Recovering from Services Failures” steps in helmfile-installation
to redeploy. The next bootstrap job run will include detailed debug logs visible via
kubectl logs job/nvcf-api-account-bootstrap -n nvcf.
To disable debug logging afterward:
Common Causes:
- Invalid registry credentials format - See [Incorrect Base64 Docker Credentials Format]
- Wrong registry hostname - Hostname in secrets doesn’t match actual registry (e.g., using
nvcr.iobut credentials are for ECR) - Missing “$oauthtoken“ prefix - NGC credentials must be in format
$oauthtoken:API_KEY
Solution:
Fix your secrets/<environment-name>-secrets.yaml file, then follow the “Recovering from Services Failures” steps in helmfile-installation to preserve your dependencies.
Pods Stuck in ImagePullBackOff
Symptoms: Pods cannot pull container images
Solutions:
-
Verify registry credentials:
-
Verify images exist in your registry:
-
Check network connectivity from cluster to registry
Pods Stuck in Pending
Symptoms: Pods remain in Pending state
Solutions:
-
Check cluster resources:
-
Verify storage class exists:
-
Check node selectors:
Helm Release in Failed State After First Install
Symptom:
- First
helmfile syncfails partway through - Re-running
helmfile syncorhelmfile applyappears to succeed but things don’t work - Migrations or initialization jobs weren’t executed
Root Cause:
When a Helm installation fails, the release remains in a failed state. Subsequent commands run helm upgrade instead of helm install, which skips initialization hooks (migrations, account bootstrap, etc.).
Solution:
Fix the underlying issue (credentials, config, etc.), then follow the appropriate recovery procedure in helmfile-installation:
- If only services failed (dependencies are healthy): Use the “Recovering from Services Failures” steps to preserve your dependencies
- If dependencies are also broken: Follow the “Uninstalling” section in helmfile-installation
NVCA Operator Fails: nvcfbackends CRD Not Found
Symptom:
NVCA Operator installation fails with CRD not found error:
Root Cause:
A race condition occurs where Helm validates CRD references before the CRD is created by the operator’s installation hooks. This can happen during first install or when reinstalling after the CRD was deleted.
Solution:
Two changes are required in helmfile.d/03-worker.yaml.gotmpl:
- Add “disableValidation: true“ to the nvca-operator release to disable OpenAPI validation:
- Remove “—dry-run=server“ from the
helmDefaults.diffArgssection. This prevents server-side validation during the diff phase, which fails when the CRD doesn’t exist:
Then run ./force-cleanup-nvcf.sh followed by HELMFILE_ENV=<environment> helmfile sync.
Accessing OpenBao Secrets (CLI)
NVCF stores most service credentials, signing keys, and internal passwords in
OpenBao (a Vault-compatible secrets manager) running in the
vault-system namespace. Use the bao CLI inside the OpenBao pod to inspect or
manage these secrets.
For the full bao CLI reference, see the
OpenBao CLI documentation.
The KV secrets engine commands are documented at
OpenBao KV commands.
Retrieve the Root Token
The OpenBao root token is stored in a Kubernetes secret created during initialization:
The root token grants unrestricted access to all secrets in OpenBao. Treat it as a highly sensitive credential and avoid storing it in shell history or logs.
List Secrets Engines
To see all mounted secrets engines (each NVCF service has its own path):
Example output (abbreviated):
List and Read Secrets
Browse secrets under a specific engine path:
Use bao kv get -format=json <path> for machine-readable output, or
bao kv get -field=<key> <path> to extract a single field.
Run Arbitrary bao Commands
You can run any bao subcommand by exec-ing into the pod with the root token:
Debugging Techniques
Enabling Verbose Logging
To get more detailed logs from specific components:
For Migration Jobs:
Example For API Service:
Inspecting Failed Pods
Checking Events
Kubernetes events often contain valuable debugging information:
Recovery Procedures
For detailed recovery steps, see the Recovering from Partial Deployments section in helmfile-installation. This section provides quick reference for common scenarios.
Choosing a Recovery Strategy
Do not attempt to fix failed services by re-running helmfile sync or helmfile apply. Helm will skip initialization hooks (migrations, account bootstrap) on upgrade, resulting in a deployment that appears successful but doesn’t function correctly.
Redeploy Stuck Dependencies
Dependency services (Cassandra, NATS, OpenBao) can be safely redeployed without affecting other components:
Reinstalling NVCA Operator Only
If only NVCA needs reinstalling (and NVCF services are working):
If NVCF services are also broken, follow the “Recovering from Services Failures” steps in helmfile-installation.
NVCA Force Cleanup Script
If helmfile destroy hangs on NVCA cleanup (typically when functions are still deployed in nvcf-backend), use the force cleanup script in a new terminal. See force-cleanup-script for the full script and usage instructions.
Database Issues
Cassandra Migration Stuck Due to Missing ConfigMap
Symptoms:
Cassandra pods are running but migration job is stuck.
Diagnosis:
Check all Cassandra resources including ConfigMaps:
Expected output should show 3 ConfigMaps:
If you only see 2 ConfigMaps (missing cassandra-migrations), this is a race condition during deployment.
Root Cause:
A race condition can occur where the Cassandra migration job starts before all ConfigMaps are created, causing the deployment to hang.
Solution:
Force a sync to recreate missing resources:
The sync command differs from apply in that it will recreate resources if needed, which resolves the ConfigMap race condition.
Alternative Solution:
If the above doesn’t work, you can safely redeploy Cassandra (it’s a dependency without complex initialization hooks):
Getting Help
When requesting support, provide:
-
Environment details:
-
Deployment configuration:
- Environment file (sanitized)
- Secrets file structure (sanitized - no actual secrets!)
-
Relevant logs:
-
Events:
-
Resource status:
Appendix: NVCA Force Cleanup Script
This script forcefully removes all NVCA components from a cluster. Use it when helmfile destroy hangs on NVCA cleanup, typically because functions are still deployed in nvcf-backend.
This script bypasses normal cleanup procedures by removing finalizers. Always try helmfile destroy first.
Usage:
-
Download or copy the script to your working directory
-
Make executable:
chmod +x force-cleanup-nvcf.sh -
Preview what will be deleted:
-
Run the cleanup:
What the script does:
- Lists and force-deletes all function pods in
nvcf-backendnamespace - Deletes all NVCFBackend custom resources
- Deletes Helm releases in NVCA namespaces
- Removes finalizers from stuck NVCFBackend resources
- Deletes the NVCA namespaces (
nvcf-backend,nvca-system,nvca-operator) - Removes finalizers from namespaces stuck in Terminating state
- Deletes the NVCFBackend CRD
- Verifies cleanup completion