Troubleshooting#
Here are some commonly occurring issues and ways to resolve them.
General Issues#
Problem:
a. Error signatures: “The maximum number of addresses has been reached”
b. Cause: The quota for Elastic IP Addresses has been reached.
c. Resolution: Either delete old Elastic IP Addresses or increase the quota. increase the quota.
Problem:
a. Error signatures:
“The maximum number of internet gateways has been reached”
“The maximum number of VPCs has been reached”
b. Cause: The quota for Virtual Private Clusters has been reached.
c. Resolution: Either delete old Elastic IP Addresses or increase the quota.
Problem:
a. Error signatures:
“An error occurred (ExpiredTokenException) when calling the DescribeNodegroup operation: The security token included in the request is expired”
“WARN: failed to get session token, falling back to IMDSv1: 404 Not Found: Not Found”
“ERROR session: fetching region failed: NoCredentialProviders: no valid providers in chain. Deprecated.”
b. Cause: The AWS access key has expired.
c. Resolution: Set the new keys using the ENV variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN.Problem:
a. Error signatures: “socket.gaierror: [Errno -2] Name or service not known” from command - “vius pipelines list”.
b. Cause: Pods are still initializing or failing.
c. Resolution: Wait until the pods are initialized and try the command again. Contact the NVIDIA team if the issue persists. Use
kubectl get podsto check the health of the pods.Problem:
a. Error signatures: “Max retries exceeded with url: /v1/pipelines” from command - “vius pipelines list”
b. Cause: Pods are still initializing or failing.
c. Resolution: Wait until the pods are initialized and try the command again. Contact the NVIDIA team if the issue persists. Use
kubectl get podsto check the health of the pods.Problem:
a. Error signatures: “Got exception 502 Server Error: Bad Gateway for url: http://<>/api/v1/pipelines” from command - “vius pipelines list”
b. Cause: The CDS API endpoint is not ready.
c. Resolution: Wait for some time and try the command again. Contact the NVIDIA team if the issue persists.
Problem:
a. Error signatures: UI Error: “Failed to search in collection
<Collection ID>. Error details: “Something went wrong with the request: <ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity>”.b. Cause: The UI is not able to access the assets (e.g. images/videos from the S3 bucket). The AWS configure step for the S3 bucket does not work properly with the installation scripts at the moment.
c. Resolution:
Delete the AWS config file.
Re-run
./enable_s3.sh. This will reset the policies for the bucket.Uninstall CDS service using
helm uninstall visual-search.Install the CDS service using
helm install visual-search visual-search --values=values.yaml.Refresh the UI, and it should display the assets.
Problem:
a. Error signature: S3 bucket creation fails with the “IllegalLocationConstraintException” message.
b. Cause: The region is not one of the valid regions, as mentioned in the AWS S3 API guide.
c. Resolution: Use one of the valid regions to create the S3 bucket or modify the script to comply with the AWS S3 API requirements.
Problem:
a. Error signature: Problem installing
viuspip client outside of installation docker container.b. Resolution: The
viuspip client requirespip<=25.
Cosmos Embed NIM Service Issues#
Problem:
a. Error signatures:
“Failed to download model from NGC”
“NGC authentication failed”
“Model download timeout”
b. Cause: Issues with NGC API key or model access permissions for Cosmos Embed NIM
c. Resolution:
Verify the
NGC_API_KEYenvironment variable is set correctly.Ensure the API key has access to the
nvidia/cosmos-embedmodel.Check network connectivity to
nvcr.ioIncrease timeout settings in the
cosmos-embed-override.yamlfile.
Problem:
a. Error signatures:
“CUDA out of memory”
“Pod killed due to memory limit”
“cosmos-embed pod in CrashLoopBackOff”
b. Cause: Insufficient GPU memory for Cosmos Embed NIM model.
c. Resolution:
Increase GPU memory limits in
cosmos-embed-override.yaml.Reduce batch sizes in pipeline configuration.
Use GPU with more memory (A100/H100 recommended).
Enable model quantization options if available.
Problem:
a. Error signatures:
“Model cache not found”
“Persistent volume claim failed”
“Storage class not found: high-perf-gp3”
b. Cause: Storage configuration issues with the Cosmos Embed NIM model cache
c. Resolution:
Verify the
high-perf-gp3storage class is properly configured.Check PVC creation and binding status with the
kubectl get pvccommand.Ensure a sufficient storage quota (50GB+) for the model cache.
Verify the EBS CSI driver is properly installed.
Problem:
a. Error signatures:
“cosmos-embed service unavailable”
“Connection refused to cosmos-embed:8000”
“Timeout waiting for cosmos-embed to be ready”
b. Cause: Cosmos Embed NIM service has not properly started or health checks are failing.
c. Resolution:
Check the pod status:
kubectl get pods -l app.kubernetes.io/name=nvidia-nim-cosmos-embedReview the pod logs:
kubectl logs -f <cosmos-embed-pod-name>Verify GPU node scheduling and tolerances.
Check startup probe timeout settings (30+ minutes may be needed for the first boot).
Use the debug script:
./debug_cosmos_embed.sh
REST API Troubleshooting#
API Not Responding#
Check if the CDS API is accessible:
curl http://localhost:8888/health
If this fails, verify services are running:
docker compose -f deploy/standalone/docker-compose.build.yml ps
make test-integration-logs
Invalid Request Errors (400)#
Problem: API returns 400 Bad Request
Resolution:
Verify JSON syntax is correct
Check the request matches the expected schema in the interactive API docs
Use Swagger UI at
http://localhost:8888/v1/docsto see required fields and test requestsEnsure all required fields are provided
Collection Not Found (404)#
Problem: API returns 404 for collection operations
Resolution:
Verify the collection ID is correct (check for typos)
List all collections to see available IDs:
curl http://localhost:8888/v1/collectionsEnsure the collection wasn’t deleted
Video Ingestion Fails#
Problem: Video ingestion returns errors or fails silently
Resolution:
For LocalStack: Verify videos exist in bucket:
# Using boto3 python -c "import boto3; s3=boto3.client('s3',endpoint_url='http://localhost:4566',aws_access_key_id='test',aws_secret_access_key='test'); print(s3.list_objects_v2(Bucket='cosmos-test-bucket',Prefix='videos/'))"
Check CDS service logs:
docker compose -f deploy/standalone/docker-compose.build.yml logs visual-search --tail=50
Verify Cosmos-embed NIM is running and ready:
curl http://localhost:9000/v1/health/readyCheck presigned URLs are accessible (should return 200):
# Generate a test presigned URL and verify python -c "import boto3,requests; s3=boto3.client('s3',endpoint_url='http://localhost:4566',aws_access_key_id='test',aws_secret_access_key='test'); url=s3.generate_presigned_url('get_object',Params={'Bucket':'cosmos-test-bucket','Key':'videos/video70.mp4'},ExpiresIn=3600); print(f'URL: {url}'); r=requests.head(url); print(f'Status: {r.status_code}')"