Data Designer Troubleshooting#
This guide covers common issues and troubleshooting steps for the NeMo Data Designer microservice.
Common Issues#
Image Pull Errors#
Problem: Cannot pull Docker images from NGC registry.
Solution:
# Verify NGC authentication
docker login nvcr.io -u '$oauthtoken' -p ${NGC_CLI_API_KEY}
# Check if the image tag exists
docker pull nvcr.io/nvidia/nemo-microservices/data-designer:25.08
Permission Errors#
Problem: Artifacts directory permission errors.
Solution:
# Ensure artifacts directory has correct permissions
sudo chown -R 1000:1000 /path/to/artifacts
chmod -R 755 /path/to/artifacts
API Connection Issues#
Problem: Cannot connect to LLM endpoints or Data Designer API.
Solutions:
Test LLM endpoint connectivity:
# Test NVIDIA API connectivity
curl -H "Authorization: Bearer ${NVIDIA_API_KEY}" \
https://integrate.api.nvidia.com/v1/models
# Check Data Designer health
curl http://localhost:8000/health
Check firewall and network settings:
# Verify port is accessible
netstat -tlnp | grep :8000
# Test from inside container
docker exec -it data-designer curl localhost:8000/health
Memory Issues#
Problem: Service running out of memory or performing poorly.
Solutions:
# Monitor resource usage
docker stats data-designer
# Check Docker memory limits
docker info | grep -i memory
# Increase Docker memory limits if needed
# Minimum recommended: 4GB RAM
For Docker Desktop users, increase memory allocation in Settings > Resources.
Asset Download Issues#
Problem: Cannot download assets from S3 or slow asset loading.
Solutions:
Use local assets instead of S3:
# Download assets manually
mkdir -p ~/dev/data-designer-assets/datasets
cd ~/dev/data-designer-assets/datasets
export BUCKET_PATH="https://gretel-managed-assets-tmp-usw2.s3.us-west-2.amazonaws.com/datasets/"
curl -fL ${BUCKET_PATH}personal_details_streaming_1m.parquet -o personal_details_streaming_1m.parquet
curl -fL ${BUCKET_PATH}synthetic_personas_06_12_25_first_1000.parquet -o synthetic_personas_06_12_25.parquet
# Configure Data Designer to use local assets
export NEMO_MICROSERVICES_DATA_DESIGNER_ASSETS_STORAGE=~/dev/data-designer-assets
# Verify asset files exist
ls -la ~/dev/data-designer-assets/datasets/
Debug Mode#
Enable Debug Logging#
For detailed troubleshooting information:
export LOG_LEVEL=DEBUG
docker-compose up data-designer
Container Debugging#
Access the running container for debugging:
# Get container ID
docker ps | grep data-designer
# Access container shell
docker exec -it <container-id> /bin/bash
# Check environment variables
docker exec -it <container-id> env | grep NEMO
# Check mounted volumes
docker exec -it <container-id> ls -la /artifacts_root
Log Analysis#
# View detailed logs
docker-compose logs --details data-designer
# Follow logs with timestamps
docker-compose logs -f -t data-designer
# Filter logs by level
docker-compose logs data-designer | grep ERROR
Performance Issues#
Slow Response Times#
Check these common causes:
Asset Loading: Use local assets instead of S3
Memory Constraints: Increase Docker memory allocation
LLM Endpoint: Verify LLM service is responding quickly
# Test LLM endpoint response time
time curl -H "Authorization: Bearer ${NVIDIA_API_KEY}" \
https://integrate.api.nvidia.com/v1/models
# Monitor container performance
docker stats data-designer --no-stream
High Memory Usage#
# Check memory usage patterns
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Review container resource limits
docker inspect data-designer | grep -A 5 "Memory"
Service Connectivity Issues#
Health Check Failures#
# Test health endpoint
curl -v http://localhost:8000/health
# Check if service is listening
netstat -tlnp | grep 8000
# Verify container is running
docker ps | grep data-designer
API Endpoint Issues#
# Test API endpoints
curl -X POST http://localhost:8000/v1beta1/data-designer/preview \
-H "Content-Type: application/json" \
-d '{"config": {"columns": [{"name": "test", "type": "name"}]}}'
# Check API documentation
curl http://localhost:8000/docs
Data Issues#
Generation Failures#
Check logs for specific errors:
docker-compose logs data-designer | grep -A 5 -B 5 "error\|exception\|failed"
Common causes:
Invalid column configurations
LLM endpoint unavailable
Insufficient disk space for artifacts
Asset File Issues#
# Verify asset files are accessible
docker exec -it data-designer ls -la /app/data/data-designer/datasets/
# Check file permissions
docker exec -it data-designer stat /app/data/data-designer/datasets/*.parquet
# Test asset loading manually
docker exec -it data-designer python -c "
import pandas as pd
df = pd.read_parquet('/app/data/data-designer/datasets/personal_details_streaming_1m.parquet')
print(f'Loaded {len(df)} records')
"
Limitations and Known Issues#
Development Use: Docker Compose deployment is recommended for development and testing, not production
Single Node: This setup runs on a single machine without high availability
Storage: Artifacts are stored locally; consider backup strategies for important data
Scalability: Limited horizontal scaling compared to Kubernetes deployments
Security: Default configuration may not meet production security requirements
For production deployments, consider using Kubernetes with proper ingress, storage classes, and security configurations.