Kubernetes Troubleshooting Guide#
This guide covers common issues and solutions when deploying CDS on Kubernetes (Tested for AWS EKS).
General Troubleshooting Steps#
Checking Pod Status#
# Check all pods
docker exec cds-deployment bash -c "kubectl get pods"
# Check pods in specific namespace
docker exec cds-deployment bash -c "kubectl get pods -n default"
# Watch pods in real-time
docker exec cds-deployment bash -c "kubectl get pods -w"
Viewing Logs#
# View logs from a specific pod
docker exec cds-deployment bash -c "kubectl logs <pod-name> --tail=100"
# Follow logs in real-time
docker exec cds-deployment bash -c "kubectl logs -f <pod-name>"
# View logs from all pods with a label
docker exec cds-deployment bash -c "kubectl logs -l app=visual-search --tail=50"
Describing Resources#
# Get detailed pod information
docker exec cds-deployment bash -c "kubectl describe pod <pod-name>"
# Describe a deployment
docker exec cds-deployment bash -c "kubectl describe deployment <deployment-name>"
# Check events (shows recent cluster events)
docker exec cds-deployment bash -c "kubectl get events --sort-by='.lastTimestamp'"
Accessing Pod Shell#
# Open interactive shell in a pod
docker exec -it cds-deployment bash -c "kubectl exec -it <pod-name> -- /bin/bash"
Common Issues#
Pods Won’t Start#
ImagePullBackOff#
Symptom: Pod stuck in ImagePullBackOff or ErrImagePull status
Check:
docker exec cds-deployment bash -c "kubectl describe pod <pod-name> | grep -A 10 Events"
Common Causes:
NGC API key is incorrect or not set
Image doesn’t exist or access denied
Network issues preventing image download
Solution:
# Verify NGC secrets exist
docker exec cds-deployment bash -c "kubectl get secret nvcr-io ngc-api ngc-secret"
# Check secret contents (base64 encoded)
docker exec cds-deployment bash -c "kubectl describe secret ngc-api"
# If secrets are wrong, recreate them
# Inside container:
cd /workspace/blueprint/bringup
./secrets.sh
# Restart the problematic pod
docker exec cds-deployment bash -c "kubectl delete pod <pod-name>"
CrashLoopBackOff#
Symptom: Pod keeps restarting
Check logs:
docker exec cds-deployment bash -c "kubectl logs <pod-name> --previous"
Common Causes:
Application configuration error
Missing environment variables
Out of memory
Dependency service not available
Solution: Check logs for specific error messages and fix the root cause.
Pending Pods#
Symptom: Pod stuck in Pending status
Check:
docker exec cds-deployment bash -c "kubectl describe pod <pod-name>"
Common Causes:
Insufficient resources (CPU/memory/GPU)
Volume cannot be mounted
Node selector doesn’t match any nodes
Taints preventing scheduling
Solutions:
# Check node resources
docker exec cds-deployment bash -c "kubectl describe nodes | grep -A 10 'Allocated resources'"
# Check if nodes match selectors
docker exec cds-deployment bash -c "kubectl get nodes --show-labels | grep role=cvs-gpu"
# Check PVC status
docker exec cds-deployment bash -c "kubectl get pvc"
Insufficient Resources#
Symptom: Pod shows insufficient CPU, memory, or GPU
Solution:
# Check node capacity
docker exec cds-deployment bash -c "kubectl describe node <node-name>"
# For GPU nodes specifically
docker exec cds-deployment bash -c "kubectl describe nodes -l role=cvs-gpu | grep -A 15 'Allocated resources'"
# Scale up node group if needed
docker exec cds-deployment bash -c "eksctl scale nodegroup --cluster=\$CLUSTER_NAME --name=cvs-gpu --nodes=3"
GPU Issues#
GPU Not Available to Pods#
Check if GPU nodes exist:
docker exec cds-deployment bash -c "kubectl get nodes -l role=cvs-gpu"
Check if NVIDIA device plugin is running:
docker exec cds-deployment bash -c "kubectl get pods -n kube-system | grep nvidia"
Solution:
# Check NVIDIA device plugin logs
docker exec cds-deployment bash -c "kubectl logs -n kube-system -l name=nvidia-device-plugin-ds"
# Restart device plugin if needed
docker exec cds-deployment bash -c "kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds"
NVIDIA Device Plugin Issues#
Verify plugin is running on GPU nodes:
docker exec cds-deployment bash -c "kubectl get pods -n kube-system -o wide | grep nvidia"
Check GPU availability:
# Exec into GPU node and check nvidia-smi
docker exec cds-deployment bash -c "kubectl debug node/<gpu-node-name> -it --image=nvidia/cuda:11.8.0-base-ubuntu22.04 -- nvidia-smi"
Networking Issues#
Service Not Reachable#
Check service exists:
docker exec cds-deployment bash -c "kubectl get svc"
Check endpoints:
docker exec cds-deployment bash -c "kubectl get endpoints <service-name>"
Solution:
# Test service connectivity from another pod
docker exec cds-deployment bash -c "kubectl run -it --rm debug --image=busybox --restart=Never -- wget -O- http://<service-name>:<port>"
DNS Resolution Failures#
Test DNS from within cluster:
docker exec cds-deployment bash -c "kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>"
Check CoreDNS pods:
docker exec cds-deployment bash -c "kubectl get pods -n kube-system -l k8s-app=kube-dns"
Ingress Not Working#
Check ingress status:
docker exec cds-deployment bash -c "kubectl get ingress simple-ingress"
Check ingress controller:
docker exec cds-deployment bash -c "kubectl get pods -n ingress-nginx"
Check load balancer:
docker exec cds-deployment bash -c "kubectl get svc -n ingress-nginx"
Common Issues:
ALB not created: Check ingress controller logs
404 errors: Verify ingress paths match service endpoints
Connection refused: Wait 2-3 minutes for ALB to initialize
LoadBalancer Pending#
Check service type:
docker exec cds-deployment bash -c "kubectl get svc -n ingress-nginx"
AWS Specific: Verify cloud controller is working:
docker exec cds-deployment bash -c "kubectl logs -n kube-system -l app=aws-cloud-controller-manager"
Storage Issues#
PVC Pending#
Check PVC status:
docker exec cds-deployment bash -c "kubectl get pvc"
docker exec cds-deployment bash -c "kubectl describe pvc <pvc-name>"
Common Causes:
Storage class doesn’t exist
No available volumes
Zone mismatch
Solution:
# Check storage classes
docker exec cds-deployment bash -c "kubectl get storageclass"
# For AWS EKS, verify EBS CSI driver is running
docker exec cds-deployment bash -c "kubectl get pods -n kube-system | grep ebs-csi"
Volume Mount Failures#
Check pod events:
docker exec cds-deployment bash -c "kubectl describe pod <pod-name> | grep -A 20 Events"
Common Causes:
PVC not bound
Volume in use by another pod (RWO volumes)
Permission issues
Solution: Delete pod and let it recreate, or delete PVC and recreate volume.
Model Loading Issues#
Models Not Found (Cosmos-embed)#
Symptom: Cosmos-embed pod stays in ContainerCreating for long time
Check:
docker exec cds-deployment bash -c "kubectl logs -l app.kubernetes.io/name=nvidia-nim-cosmos-embed --tail=50"
This is usually normal: Cosmos-embed downloads ~20GB model on first start.
Expected logs:
INFO Downloaded filename: Cosmos-Embed1/model-00001-of-00005.safetensors
INFO Downloaded filename: Cosmos-Embed1/model-00002-of-00005.safetensors
Duration: 10-15 minutes on first download
Out of Memory When Loading Models#
Check pod resource limits:
docker exec cds-deployment bash -c "kubectl describe pod -l app.kubernetes.io/name=nvidia-nim-cosmos-embed | grep -A 10 Limits"
Check node memory:
docker exec cds-deployment bash -c "kubectl top nodes"
Solution: Ensure cosmos-embed is scheduled on a node with sufficient GPU memory (16GB minimum).
Database Issues#
Milvus Connection Errors#
Check Milvus pods:
docker exec cds-deployment bash -c "kubectl get pods | grep milvus"
Check Milvus proxy logs:
docker exec cds-deployment bash -c "kubectl logs deployment/milvus-proxy --tail=100"
Test Milvus connectivity:
docker exec cds-deployment bash -c "kubectl run -it --rm debug --image=busybox --restart=Never -- nc -zv milvus.default.svc.cluster.local 19530"
etcd Issues#
Check etcd pod:
docker exec cds-deployment bash -c "kubectl get pods | grep etcd"
docker exec cds-deployment bash -c "kubectl logs milvus-etcd-0 --tail=100"
Common Issues:
Disk full: Check PVC size
Memory issues: Check resource limits
Data Persistence Issues#
Check PVCs for Milvus:
docker exec cds-deployment bash -c "kubectl get pvc | grep milvus"
Check if data is actually in S3:
aws s3 ls s3://$S3_BUCKET_NAME/ --recursive | head -20
Performance Issues#
Slow Query Response#
Check Milvus query node:
docker exec cds-deployment bash -c "kubectl logs deployment/milvus-querynode --tail=100"
docker exec cds-deployment bash -c "kubectl top pod -l component=querynode"
Solution:
Ensure query node is on high-memory instance (r7i.4xlarge)
Check if GPU CAGRA index is being used
Verify mmap is enabled for large collections
High Memory Usage#
Check pod memory:
docker exec cds-deployment bash -c "kubectl top pods"
Check node memory:
docker exec cds-deployment bash -c "kubectl top nodes"
Solution: Scale horizontally or increase node size if consistently hitting limits.
GPU Underutilization#
Check GPU usage:
# Get GPU node name
GPU_NODE=$(docker exec cds-deployment bash -c "kubectl get nodes -l role=cvs-gpu -o jsonpath='{.items[0].metadata.name}'")
# Check GPU utilization
docker exec cds-deployment bash -c "kubectl debug node/$GPU_NODE -it --image=nvidia/cuda:11.8.0-base-ubuntu22.04 -- nvidia-smi"
Service-Specific Issues#
CDS Service#
Check status:
docker exec cds-deployment bash -c "kubectl get pods -l app=visual-search"
docker exec cds-deployment bash -c "kubectl logs deployment/visual-search --tail=100"
Common Issues:
Can’t connect to Milvus: Check Milvus is running
Can’t connect to Cosmos-embed: Check cosmos-embed pod is ready
500 errors: Check logs for specific error messages
Cosmos-embed NIM#
Check pod status:
docker exec cds-deployment bash -c "kubectl get pods -l app.kubernetes.io/name=nvidia-nim-cosmos-embed"
Debug script available:
docker exec cds-deployment bash -c "cd /workspace/blueprint/bringup && ./debug_cosmos_embed.sh"
Common Issues:
ImagePullBackOff: Verify NGC_API_KEY
Pending: Check GPU nodes and resources
Long startup time: Normal - downloading model
Milvus Vector Database#
Check all Milvus components:
docker exec cds-deployment bash -c "kubectl get pods | grep milvus"
Expected pods (15 total):
milvus-proxy
milvus-rootcoord, datacoord, querycoord, indexcoord
milvus-querynode, datanode, indexnode
milvus-etcd-0
milvus-kafka-0, kafka-1, kafka-2
milvus-zookeeper-0
Check specific component:
docker exec cds-deployment bash -c "kubectl logs deployment/milvus-<component> --tail=100"
Object Storage (S3)#
Verify S3 bucket exists:
aws s3 ls s3://$S3_BUCKET_NAME/
Check service account has S3 permissions:
docker exec cds-deployment bash -c "kubectl describe serviceaccount s3-access-sa"
Should show IAM role ARN annotation.
Test S3 access from pod:
docker exec cds-deployment bash -c "kubectl run -it --rm s3-test --image=amazon/aws-cli --serviceaccount=s3-access-sa --restart=Never -- s3 ls s3://$S3_BUCKET_NAME/"
Web UI#
Check UI pod:
docker exec cds-deployment bash -c "kubectl get pods -l app=visual-search-react-ui"
docker exec cds-deployment bash -c "kubectl logs deployment/visual-search-react-ui --tail=50"
Check UI can reach API:
UI needs ingress hostname configured
Check environment variables in UI pod
Debugging Techniques#
Enable Debug Logging#
For CDS:
# Set log level via environment variable
docker exec cds-deployment bash -c "kubectl set env deployment/visual-search LOG_LEVEL=DEBUG"
Port Forwarding for Local Access#
# Forward CDS API
docker exec -it cds-deployment bash -c "kubectl port-forward svc/visual-search 8888:8888"
# Forward Milvus
docker exec -it cds-deployment bash -c "kubectl port-forward svc/milvus 19530:19530"
# Forward Cosmos-embed
docker exec -it cds-deployment bash -c "kubectl port-forward svc/cosmos-embed 8000:8000"
Resource Monitoring#
Check resource usage:
# Pod resources
docker exec cds-deployment bash -c "kubectl top pods"
# Node resources
docker exec cds-deployment bash -c "kubectl top nodes"
# Detailed node info
docker exec cds-deployment bash -c "kubectl describe nodes | grep -A 10 'Allocated resources'"
AWS EKS-Specific Issues#
EKS Node Group Issues#
Check node group status:
docker exec cds-deployment bash -c "eksctl get nodegroup --cluster=\$CLUSTER_NAME"
Scale node group:
docker exec cds-deployment bash -c "eksctl scale nodegroup --cluster=\$CLUSTER_NAME --name=cvs-gpu --nodes=2"
IAM Role Problems#
Check service account IAM role:
docker exec cds-deployment bash -c "kubectl describe serviceaccount s3-access-sa"
Should have annotation: eks.amazonaws.com/role-arn
List IAM roles:
docker exec cds-deployment bash -c "aws iam list-roles | grep \$CLUSTER_NAME"
EBS CSI Driver Issues#
Check EBS CSI driver status:
docker exec cds-deployment bash -c "eksctl get addon --cluster=\$CLUSTER_NAME --name=aws-ebs-csi-driver"
Check driver pods:
docker exec cds-deployment bash -c "kubectl get pods -n kube-system | grep ebs-csi"
Data Issues#
Cannot Ingest Data#
Check:
Collection exists:
cds collections listS3 bucket accessible:
aws s3 ls s3://$S3_BUCKET_NAME/Videos in S3:
aws s3 ls s3://$S3_BUCKET_NAME/videos/AWS profile configured (if using –s3-profile)
Test ingestion with limit:
cds ingest files s3://bucket/videos/ \
--collection-id <id> \
--extensions mp4 \
--limit 1 \
--s3-profile cds-s3-aws
Search Returns No Results#
Check collection has documents:
cds collections get <collection-id>
Look for total_documents_count in output.
Check Cosmos-embed is working:
docker exec cds-deployment bash -c "kubectl logs -l app.kubernetes.io/name=nvidia-nim-cosmos-embed --tail=50"
Test Cosmos-embed endpoint:
# Port forward and test
docker exec -it cds-deployment bash -c "kubectl port-forward svc/cosmos-embed 8000:8000" &
curl -k http://localhost:8000/v1/health/ready
Recovery Procedures#
Restart Specific Pods#
# Restart CDS
docker exec cds-deployment bash -c "kubectl rollout restart deployment/visual-search"
# Restart cosmos-embed
docker exec cds-deployment bash -c "kubectl rollout restart deployment/cosmos-embed-nvidia-nim-cosmos-embed"
# Restart Milvus component
docker exec cds-deployment bash -c "kubectl rollout restart deployment/milvus-proxy"
Scale Down and Up#
# Scale down deployment
docker exec cds-deployment bash -c "kubectl scale deployment/visual-search --replicas=0"
# Scale back up
docker exec cds-deployment bash -c "kubectl scale deployment/visual-search --replicas=1"
Delete and Recreate Service#
# Uninstall Helm release
docker exec cds-deployment bash -c "helm uninstall visual-search"
# Reinstall
docker exec cds-deployment bash -c "cd /workspace/blueprint/bringup && helm upgrade --install visual-search visual-search --values values.yaml"
Getting Help#
Collecting Diagnostic Information#
When reporting issues, collect:
# Pod status
docker exec cds-deployment bash -c "kubectl get pods -o wide"
# Recent events
docker exec cds-deployment bash -c "kubectl get events --sort-by='.lastTimestamp' | tail -50"
# Service logs
docker exec cds-deployment bash -c "kubectl logs deployment/visual-search --tail=200"
docker exec cds-deployment bash -c "kubectl logs -l app.kubernetes.io/name=nvidia-nim-cosmos-embed --tail=200"
# Node status
docker exec cds-deployment bash -c "kubectl get nodes"
docker exec cds-deployment bash -c "kubectl describe nodes | grep -A 10 'Allocated resources'"
# Helm releases
docker exec cds-deployment bash -c "helm list"
Reporting Issues#
Include:
Pod status (
kubectl get pods)Relevant logs (
kubectl logs)Pod descriptions (
kubectl describe pod)Cluster info (EKS version, node types)