Docker Compose Pattern Examples#
Overview#
- This reference provides complete examples of common multi-container patterns in AI Workbench.
Each pattern addresses a specific use case with full compose file examples. Use these as starting points and adapt them to your specific requirements.
- The patterns cover three common scenarios: model selection, full pipelines, and custom microservices.
Model selection lets users choose between multiple NIMs based on available hardware. Full pipelines demonstrate complex multi-service AI applications. Custom microservices show how to build and connect your own services.
- All examples use real-world configurations from NVIDIA reference projects.
These patterns are tested and production-ready. Adapt image names, ports, and configurations to match your specific needs.
Pattern 1: NIM Model Selection with Profiles#
Use Case#
- Run one of several NIM models based on available GPU resources.
Users select a model size that fits their hardware. Multiple models share the same port and interface. Profiles enable easy switching without maintaining separate compose files.
- This pattern works well for:
Development and testing with different model sizes
Demos where users have varying hardware capabilities
Projects where model selection happens at deployment time
Key Features#
Multiple services with identical interfaces
Profile-based service selection
Variable GPU requirements per model
Shared network configuration
Model cache volume management
Example Configuration#
Compose file with three model variants:
services:
llama-3.1-8b-instruct:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- type: bind
source: /tmp
target: /opt/nim/.cache/
environment:
- NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
networks:
- app-network
profiles:
- meta/llama-3.1-8b-instruct
llama-3.1-70b-instruct:
image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- type: bind
source: /tmp
target: /opt/nim/.cache/
environment:
- NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
networks:
- app-network
profiles:
- meta/llama-3.1-70b-instruct
llama-3.1-405b-instruct:
image: nvcr.io/nim/meta/llama-3.1-405b-instruct:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 8
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- type: bind
source: /tmp
target: /opt/nim/.cache/
environment:
- NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
networks:
- app-network
profiles:
- meta/llama-3.1-405b-instruct
networks:
app-network:
driver: bridge
Configuration Notes#
- GPU count varies by model size:
8B model: 1 GPU
70B model: 2 GPUs
405B model: 8 GPUs
- All models use the same port (8000):
Only one model can run at a time. This is intentional for easy model switching.
- Model cache mounted to /tmp:
Change
source: /tmpto a dedicated directory for persistent caching. Ensure the directory has write permissions.- NGC_API_KEY is required:
Set the
NVIDIA_API_KEYsecret in AI Workbench. The compose file validates this variable is set before starting.- Profiles match model names:
Select the profile matching your desired model in the AI Workbench UI. Only the selected model’s service will start.
Using This Pattern#
- Step One: Configure the NVIDIA_API_KEY secret.
Select Project Tab > Environment > Secrets
Add
NVIDIA_API_KEYwith your NGC API key
- Step Two: Select a model profile.
Select Project Tab > Environment > Compose
Select profile from dropdown (e.g.,
meta/llama-3.1-8b-instruct)
- Step Three: Start the compose environment.
Select Start
Monitor logs for model download and startup
Wait for the service to become ready
Success: The selected NIM is accessible at http://localhost:8000
Pattern 2: Full RAG Pipeline with Multiple Services#
Use Case#
- Run a complete RAG system with ingestion, retrieval, generation, and frontend services.
Each component runs in its own container with specific GPU assignments. Multiple profiles enable running different subsets of services. Services communicate over a shared network with healthcheck dependencies.
- This pattern works well for:
Production RAG applications
End-to-end AI pipelines
Applications requiring multiple specialized models
Systems with document processing, vector search, and generation
Key Features#
Multiple GPU-accelerated services on different GPUs
Profile-based deployment modes (local, ingest, rag, vectordb, guardrails)
Service dependencies with healthchecks
Persistent storage with volumes
Web service integration with NVWB_TRIM_PREFIX
Example Configuration#
Compose file excerpt showing key services:
services:
# LLM for response generation
nim-llm:
container_name: nim-llm-ms
image: nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:1.13.1
volumes:
- ${MODEL_DIRECTORY:-/tmp}:/opt/nim/.cache
ports:
- "8999:8000"
environment:
NGC_API_KEY: ${NGC_API_KEY}
shm_size: 20gb
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['${LLM_MS_GPU_ID:-1}']
capabilities: [gpu]
healthcheck:
test: ["CMD", "python3", "-c", "import requests; requests.get('http://localhost:8000/v1/health/ready')"]
interval: 10s
timeout: 20s
retries: 100
profiles: ["local"]
# Embedding model
nemoretriever-embedding-ms:
container_name: nemoretriever-embedding-ms
image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.0
volumes:
- ${MODEL_DIRECTORY:-/tmp}:/opt/nim/.cache
ports:
- "9080:8000"
environment:
NGC_API_KEY: ${NGC_API_KEY}
shm_size: 16GB
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['${EMBEDDING_MS_GPU_ID:-0}']
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/v1/health/ready"]
interval: 30s
timeout: 20s
retries: 3
start_period: 10m
profiles: ["local"]
# RAG orchestrator server
rag-server:
container_name: rag-server
image: nvcr.io/nvidia/blueprint/rag-server:2.3.0
command: --port 8081 --host 0.0.0.0 --workers 8
environment:
APP_VECTORSTORE_URL: "http://milvus:19530"
APP_LLM_SERVERURL: "nim-llm:8000"
APP_EMBEDDINGS_SERVERURL: "nemoretriever-embedding-ms:8000"
NVIDIA_API_KEY: ${NGC_API_KEY}
ports:
- "8081:8081"
shm_size: 5gb
profiles: ["rag"]
# Frontend UI
rag-frontend:
container_name: rag-frontend
image: nvcr.io/nvidia/blueprint/rag-frontend:2.3.0
ports:
- "8090:3000"
depends_on:
- rag-server
environment:
VITE_API_CHAT_URL: "http://rag-server:8081/v1"
NVWB_TRIM_PREFIX: "true"
profiles: ["rag"]
# Vector database (GPU-accelerated)
milvus:
container_name: milvus-standalone
image: milvusdb/milvus:v2.5.3-gpu
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9010
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-./volumes/milvus}:/var/lib/milvus
ports:
- "19530:19530"
depends_on:
- etcd
- minio
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['${VECTORSTORE_GPU_DEVICE_ID:-0}']
capabilities: [gpu]
profiles: ["vectordb"]
# Supporting services
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.19
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-./volumes/etcd}:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
profiles: ["vectordb"]
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2025-02-28T09-55-16Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
ports:
- "9011:9011"
- "9010:9010"
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-./volumes/minio}:/minio_data
command: minio server /minio_data --console-address ":9011" --address ":9010"
profiles: ["vectordb"]
redis:
image: redis/redis-stack
ports:
- "6379:6379"
profiles: ["ingest"]
volumes:
nim_cache:
external: true
networks:
default:
name: nvidia-rag
Configuration Notes#
- Services are organized by profile:
local: GPU-accelerated inference services (NIMs)rag: RAG orchestration and frontendvectordb: Vector database and dependenciesingest: Document ingestion pipelineguardrails: Optional content safety services
- GPU assignment uses device_ids:
device_ids: ['${LLM_MS_GPU_ID:-1}']assigns a specific GPU. Environment variables (LLM_MS_GPU_ID, EMBEDDING_MS_GPU_ID) control which GPU each service uses. Default values provided after:-if variables are not set.- Services communicate by name:
APP_LLM_SERVERURL: "nim-llm:8000"connects to the nim-llm service. All services share thenvidia-ragnetwork.- Healthchecks ensure proper startup order:
Services with
depends_onwait for dependencies to be healthy. Healthchecks use HTTP endpoints or curl commands.- NVWB_TRIM_PREFIX enables proxy for frontend:
The rag-frontend service is accessible through AI Workbench’s proxy. Backend services do not need this variable.
- Volumes provide persistent storage:
Model caches in
/opt/nim/.cachepersist between restarts. Vector database and Minio data stored in./volumes/.
Using This Pattern#
- Step One: Set required environment variables.
Create a
.envfile in your project root:NGC_API_KEY=your_ngc_api_key MODEL_DIRECTORY=/path/to/model/cache LLM_MS_GPU_ID=1 EMBEDDING_MS_GPU_ID=0 VECTORSTORE_GPU_DEVICE_ID=0
- Step Two: Create required volumes.
docker volume create nim_cache mkdir -p volumes/milvus volumes/etcd volumes/minio
- Step Three: Select profiles in AI Workbench.
Select Project Tab > Environment > Compose
Select profiles:
local,rag,vectordbSelect Start
- Step Four: Wait for all services to become healthy.
Monitor compose output for healthcheck status. NIMs may take several minutes to download and initialize.
Success: Access the RAG frontend through the AI Workbench proxy URL.
Pattern 3: Custom Microservices with Build Contexts#
Use Case#
- Build and run custom application services alongside supporting infrastructure.
Your own code runs in containers built from Dockerfiles. Services communicate through a shared network and message queues. Supporting services provide databases, caching, and observability.
- This pattern works well for:
Custom AI applications with multiple components
Microservices architectures
Applications requiring specialized build steps
Integration with external APIs and services
Key Features#
Custom Dockerfiles with build contexts
Service-to-service communication via networks
Message queues and task systems (Celery, Redis)
Observability with tracing (Jaeger)
Persistent volumes for shared data
Example Configuration#
Compose file with custom services:
services:
# Custom API service
api-service:
build:
context: .
dockerfile: services/APIService/Dockerfile
ports:
- "8002:8002"
environment:
- PDF_SERVICE_URL=http://pdf-service:8003
- AGENT_SERVICE_URL=http://agent-service:8964
- TTS_SERVICE_URL=http://tts-service:8889
- REDIS_URL=redis://redis:6379
depends_on:
- redis
- pdf-service
- agent-service
- tts-service
networks:
- app-network
# Agent service with GPU access
agent-service:
build:
context: .
dockerfile: services/AgentService/Dockerfile
ports:
- "8964:8964"
environment:
- NVIDIA_API_KEY=${NVIDIA_API_KEY}
- REDIS_URL=redis://redis:6379
- MODEL_CONFIG_PATH=/app/config/models.json
volumes:
- ./models.json:/app/config/models.json
depends_on:
- redis
networks:
- app-network
# PDF processing service
pdf-service:
build:
context: .
dockerfile: services/PDFService/Dockerfile
ports:
- "8003:8003"
environment:
- REDIS_URL=redis://redis:6379
- MODEL_API_URL=http://pdf-api:8004
depends_on:
- redis
- pdf-api
networks:
- app-network
# Celery worker for async tasks
celery-worker:
build:
context: services/PDFService/PDFModelService
dockerfile: Dockerfile.worker
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/0
volumes:
- pdf_temp:/tmp/pdf_conversions
depends_on:
- redis
restart: unless-stopped
networks:
- app-network
# Supporting services
redis:
image: redis:latest
ports:
- "6379:6379"
command: redis-server --appendonly no
networks:
- app-network
minio:
image: minio/minio:latest
ports:
- "9000:9000"
- "9001:9001"
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
volumes:
- ./data/minio:/data
command: minio server /data --console-address ":9001"
networks:
- app-network
# Observability
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP GRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- app-network
volumes:
pdf_temp:
networks:
app-network:
driver: bridge
Configuration Notes#
- Services use custom Dockerfiles:
build.contextsets the build directory.build.dockerfilespecifies the Dockerfile path. AI Workbench builds these images when starting compose.- Services communicate through service names:
PDF_SERVICE_URL=http://pdf-service:8003references the pdf-service by name. All services must be on the same network.- Redis provides message queue and caching:
Multiple services connect to the same Redis instance. Celery uses Redis as broker and result backend.
- Volumes share data between services:
pdf_tempvolume is shared between pdf-api and celery-worker. Bind mounts (./models.json) inject configuration files.- Dependencies ensure startup order:
depends_onstarts redis before services that need it. Does not wait for services to be healthy unless healthchecks are defined.- Jaeger provides distributed tracing:
Services can send traces to Jaeger for observability. Jaeger UI accessible at http://localhost:16686.
Using This Pattern#
- Step One: Ensure your Dockerfiles exist.
Verify all paths in
build.dockerfileare correct. Test building images locally before using in compose.- Step Two: Set required environment variables.
Create a
.envfile:NVIDIA_API_KEY=your_api_key ELEVENLABS_API_KEY=your_elevenlabs_key MAX_CONCURRENT_REQUESTS=5
- Step Three: Build and start services.
Select Project Tab > Environment > Compose
Select Start
First start will build all custom images (may take several minutes)
- Step Four: Verify services are running.
Check logs for each service. Test service endpoints to ensure communication works.
Success: All services are running and communicating through the shared network.
Best Practices Across All Patterns#
- Use named networks for better isolation:
Create explicit networks instead of relying on the default network. Makes service communication explicit and easier to debug.
- Define healthchecks for critical services:
Prevents dependent services from starting before dependencies are ready. Use HTTP endpoints or simple commands that verify service readiness.
- Use environment variables for configuration:
Reference secrets and configuration through
${VARIABLE_NAME}syntax. Set variables in AI Workbench or.envfiles. Never hardcode sensitive values in compose files.- Pin image versions in production:
Use specific tags (
image:1.2.3) instead oflatest. Ensures reproducible deployments across environments.- Use volumes for persistent data:
Model caches, databases, and application data should use volumes. Prevents data loss when containers restart.
- Organize services with profiles:
Group related services into profiles for different deployment scenarios. Enables flexible deployments without maintaining multiple compose files.
- Document GPU requirements clearly:
Comment GPU assignments and memory requirements. Helps users understand hardware requirements before deployment.
- Use service names for inter-service communication:
Services on the same network can reach each other by service name. Avoid using localhost or IP addresses for service-to-service calls.