Data Designer Troubleshooting#

This guide covers common issues and troubleshooting steps for the NeMo Data Designer microservice.

Common Issues#

Image Pull Errors#

Problem: Cannot pull Docker images from NGC registry.

Solution:

# Verify NGC authentication
docker login nvcr.io -u '$oauthtoken' -p ${NGC_CLI_API_KEY}

# Check if the image tag exists
docker pull nvcr.io/nvidia/nemo-microservices/data-designer:25.08

Permission Errors#

Problem: Artifacts directory permission errors.

Solution:

# Ensure artifacts directory has correct permissions
sudo chown -R 1000:1000 /path/to/artifacts
chmod -R 755 /path/to/artifacts

API Connection Issues#

Problem: Cannot connect to LLM endpoints or Data Designer API.

Solutions:

Test LLM endpoint connectivity:

# Test NVIDIA API connectivity
curl -H "Authorization: Bearer ${NVIDIA_API_KEY}" \
  https://integrate.api.nvidia.com/v1/models

# Check Data Designer health
curl http://localhost:8000/health

Check firewall and network settings:

# Verify port is accessible
netstat -tlnp | grep :8000

# Test from inside container
docker exec -it data-designer curl localhost:8000/health

Memory Issues#

Problem: Service running out of memory or performing poorly.

Solutions:

# Monitor resource usage
docker stats data-designer

# Check Docker memory limits
docker info | grep -i memory

# Increase Docker memory limits if needed
# Minimum recommended: 4GB RAM

For Docker Desktop users, increase memory allocation in Settings > Resources.

Asset Download Issues#

Problem: Cannot download assets from S3 or slow asset loading.

Solutions:

Use local assets instead of S3:

# Download assets manually
mkdir -p ~/dev/data-designer-assets/datasets
cd ~/dev/data-designer-assets/datasets

export BUCKET_PATH="https://gretel-managed-assets-tmp-usw2.s3.us-west-2.amazonaws.com/datasets/"
curl -fL ${BUCKET_PATH}personal_details_streaming_1m.parquet -o personal_details_streaming_1m.parquet
curl -fL ${BUCKET_PATH}synthetic_personas_06_12_25_first_1000.parquet -o synthetic_personas_06_12_25.parquet

# Configure Data Designer to use local assets
export NEMO_MICROSERVICES_DATA_DESIGNER_ASSETS_STORAGE=~/dev/data-designer-assets

# Verify asset files exist
ls -la ~/dev/data-designer-assets/datasets/

Debug Mode#

Enable Debug Logging#

For detailed troubleshooting information:

export LOG_LEVEL=DEBUG
docker-compose up data-designer

Container Debugging#

Access the running container for debugging:

# Get container ID
docker ps | grep data-designer

# Access container shell
docker exec -it <container-id> /bin/bash

# Check environment variables
docker exec -it <container-id> env | grep NEMO

# Check mounted volumes
docker exec -it <container-id> ls -la /artifacts_root

Log Analysis#

# View detailed logs
docker-compose logs --details data-designer

# Follow logs with timestamps
docker-compose logs -f -t data-designer

# Filter logs by level
docker-compose logs data-designer | grep ERROR

Performance Issues#

Slow Response Times#

Check these common causes:

  1. Asset Loading: Use local assets instead of S3

  2. Memory Constraints: Increase Docker memory allocation

  3. LLM Endpoint: Verify LLM service is responding quickly

# Test LLM endpoint response time
time curl -H "Authorization: Bearer ${NVIDIA_API_KEY}" \
  https://integrate.api.nvidia.com/v1/models

# Monitor container performance
docker stats data-designer --no-stream

High Memory Usage#

# Check memory usage patterns
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Review container resource limits
docker inspect data-designer | grep -A 5 "Memory"

Service Connectivity Issues#

Health Check Failures#

# Test health endpoint
curl -v http://localhost:8000/health

# Check if service is listening
netstat -tlnp | grep 8000

# Verify container is running
docker ps | grep data-designer

API Endpoint Issues#

# Test API endpoints
curl -X POST http://localhost:8000/v1beta1/data-designer/preview \
  -H "Content-Type: application/json" \
  -d '{"config": {"columns": [{"name": "test", "type": "name"}]}}'

# Check API documentation
curl http://localhost:8000/docs

Data Issues#

Generation Failures#

Check logs for specific errors:

docker-compose logs data-designer | grep -A 5 -B 5 "error\|exception\|failed"

Common causes:

  • Invalid column configurations

  • LLM endpoint unavailable

  • Insufficient disk space for artifacts

Asset File Issues#

# Verify asset files are accessible
docker exec -it data-designer ls -la /app/data/data-designer/datasets/

# Check file permissions
docker exec -it data-designer stat /app/data/data-designer/datasets/*.parquet

# Test asset loading manually
docker exec -it data-designer python -c "
import pandas as pd
df = pd.read_parquet('/app/data/data-designer/datasets/personal_details_streaming_1m.parquet')
print(f'Loaded {len(df)} records')
"

Limitations and Known Issues#

  • Development Use: Docker Compose deployment is recommended for development and testing, not production

  • Single Node: This setup runs on a single machine without high availability

  • Storage: Artifacts are stored locally; consider backup strategies for important data

  • Scalability: Limited horizontal scaling compared to Kubernetes deployments

  • Security: Default configuration may not meet production security requirements

For production deployments, consider using Kubernetes with proper ingress, storage classes, and security configurations.