Docker Compose Pattern Examples#

This page provides complete, tested compose file examples for common multi-container patterns in AI Workbench.

Pattern 1: NIM Model Selection with Profiles#

Use Case#

Run one of several NIM models based on available GPU resources.

Users select a model size that fits their hardware. Multiple models share the same port and interface. Profiles enable easy switching without maintaining separate compose files.

This pattern works well for:
  • Development and testing with different model sizes

  • Demos where users have varying hardware capabilities

  • Projects where model selection happens at deployment time

Key Features#

  • Multiple services with identical interfaces

  • Profile-based service selection

  • Variable GPU requirements per model

  • Shared network configuration

  • Model cache volume management

Example Configuration#

Compose file with three model variants:

services:
  llama-3.1-8b-instruct:
    image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - type: bind
        source: /tmp
        target: /opt/nim/.cache/
    environment:
      - NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
    networks:
      - app-network
    profiles:
      - meta/llama-3.1-8b-instruct

  llama-3.1-70b-instruct:
    image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - type: bind
        source: /tmp
        target: /opt/nim/.cache/
    environment:
      - NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
    networks:
      - app-network
    profiles:
      - meta/llama-3.1-70b-instruct

  llama-3.1-405b-instruct:
    image: nvcr.io/nim/meta/llama-3.1-405b-instruct:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 8
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - type: bind
        source: /tmp
        target: /opt/nim/.cache/
    environment:
      - NGC_API_KEY=${NVIDIA_API_KEY:?Error NVIDIA_API_KEY not set}
    networks:
      - app-network
    profiles:
      - meta/llama-3.1-405b-instruct

networks:
  app-network:
    driver: bridge

Configuration Notes#

GPU count varies by model size:
  • 8B model: 1 GPU

  • 70B model: 2 GPUs

  • 405B model: 8 GPUs

All models use the same port (8000):

Only one model can run at a time. This is intentional for easy model switching.

Model cache mounted to /tmp:

Change source: /tmp to a dedicated directory for persistent caching. Ensure the directory has write permissions.

NGC_API_KEY is required:

Set the NVIDIA_API_KEY secret in AI Workbench. The compose file validates this variable is set before starting.

Profiles match model names:

Select the profile matching your desired model in the AI Workbench UI. Only the selected model’s service will start.

Pattern 2: Full RAG Pipeline with Multiple Services#

Use Case#

Run a complete RAG system with ingestion, retrieval, generation, and frontend services.

Each component runs in its own container with specific GPU assignments. Multiple profiles enable running different subsets of services. Services communicate over a shared network with healthcheck dependencies.

This pattern works well for:
  • Production RAG applications

  • End-to-end AI pipelines

  • Applications requiring multiple specialized models

  • Systems with document processing, vector search, and generation

Key Features#

  • Multiple GPU-accelerated services on different GPUs

  • Profile-based deployment modes (local, ingest, rag, vectordb, guardrails)

  • Service dependencies with healthchecks

  • Persistent storage with volumes

  • Web service integration with NVWB_TRIM_PREFIX

Example Configuration#

Compose file excerpt showing key services:

services:
  # LLM for response generation
  nim-llm:
    container_name: nim-llm-ms
    image: nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:1.13.1
    volumes:
      - ${MODEL_DIRECTORY:-/tmp}:/opt/nim/.cache
    ports:
      - "8999:8000"
    environment:
      NGC_API_KEY: ${NGC_API_KEY}
    shm_size: 20gb
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['${LLM_MS_GPU_ID:-1}']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "python3", "-c", "import requests; requests.get('http://localhost:8000/v1/health/ready')"]
      interval: 10s
      timeout: 20s
      retries: 100
    profiles: ["local"]

  # Embedding model
  nemoretriever-embedding-ms:
    container_name: nemoretriever-embedding-ms
    image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.0
    volumes:
      - ${MODEL_DIRECTORY:-/tmp}:/opt/nim/.cache
    ports:
      - "9080:8000"
    environment:
      NGC_API_KEY: ${NGC_API_KEY}
    shm_size: 16GB
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['${EMBEDDING_MS_GPU_ID:-0}']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v1/health/ready"]
      interval: 30s
      timeout: 20s
      retries: 3
      start_period: 10m
    profiles: ["local"]

  # RAG orchestrator server
  rag-server:
    container_name: rag-server
    image: nvcr.io/nvidia/blueprint/rag-server:2.3.0
    command: --port 8081 --host 0.0.0.0 --workers 8
    environment:
      APP_VECTORSTORE_URL: "http://milvus:19530"
      APP_LLM_SERVERURL: "nim-llm:8000"
      APP_EMBEDDINGS_SERVERURL: "nemoretriever-embedding-ms:8000"
      NVIDIA_API_KEY: ${NGC_API_KEY}
    ports:
      - "8081:8081"
    shm_size: 5gb
    profiles: ["rag"]

  # Frontend UI
  rag-frontend:
    container_name: rag-frontend
    image: nvcr.io/nvidia/blueprint/rag-frontend:2.3.0
    ports:
      - "8090:3000"
    depends_on:
      - rag-server
    environment:
      VITE_API_CHAT_URL: "http://rag-server:8081/v1"
      NVWB_TRIM_PREFIX: "true"
    profiles: ["rag"]

  # Vector database (GPU-accelerated)
  milvus:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.5.3-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9010
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-./volumes/milvus}:/var/lib/milvus
    ports:
      - "19530:19530"
    depends_on:
      - etcd
      - minio
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['${VECTORSTORE_GPU_DEVICE_ID:-0}']
              capabilities: [gpu]
    profiles: ["vectordb"]

  # Supporting services
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.19
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-./volumes/etcd}:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    profiles: ["vectordb"]

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2025-02-28T09-55-16Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9011:9011"
      - "9010:9010"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-./volumes/minio}:/minio_data
    command: minio server /minio_data --console-address ":9011" --address ":9010"
    profiles: ["vectordb"]

  redis:
    image: redis/redis-stack
    ports:
      - "6379:6379"
    profiles: ["ingest"]

volumes:
  nim_cache:
    external: true

networks:
  default:
    name: nvidia-rag

Configuration Notes#

Services are organized by profile:
  • local: GPU-accelerated inference services (NIMs)

  • rag: RAG orchestration and frontend

  • vectordb: Vector database and dependencies

  • ingest: Document ingestion pipeline

  • guardrails: Optional content safety services

GPU assignment uses device_ids:

device_ids: ['${LLM_MS_GPU_ID:-1}'] assigns a specific GPU. Environment variables (LLM_MS_GPU_ID, EMBEDDING_MS_GPU_ID) control which GPU each service uses. Default values provided after :- if variables are not set.

Services communicate by name:

APP_LLM_SERVERURL: "nim-llm:8000" connects to the nim-llm service. All services share the nvidia-rag network.

Healthchecks ensure proper startup order:

Services with depends_on wait for dependencies to be healthy. Healthchecks use HTTP endpoints or curl commands.

NVWB_TRIM_PREFIX enables proxy for frontend:

The rag-frontend service is accessible through AI Workbench’s proxy. Backend services do not need this variable.

Volumes provide persistent storage:

Model caches in /opt/nim/.cache persist between restarts. Vector database and Minio data stored in ./volumes/.

Pattern 3: Custom Microservices with Build Contexts#

Use Case#

Build and run custom application services alongside supporting infrastructure.

Your own code runs in containers built from Dockerfiles. Services communicate through a shared network and message queues. Supporting services provide databases, caching, and observability.

This pattern works well for:
  • Custom AI applications with multiple components

  • Microservices architectures

  • Applications requiring specialized build steps

  • Integration with external APIs and services

Key Features#

  • Custom Dockerfiles with build contexts

  • Service-to-service communication via networks

  • Message queues and task systems (Celery, Redis)

  • Observability with tracing (Jaeger)

  • Persistent volumes for shared data

Example Configuration#

Compose file with custom services:

services:
  # Custom API service
  api-service:
    build:
      context: .
      dockerfile: services/APIService/Dockerfile
    ports:
      - "8002:8002"
    environment:
      - PDF_SERVICE_URL=http://pdf-service:8003
      - AGENT_SERVICE_URL=http://agent-service:8964
      - TTS_SERVICE_URL=http://tts-service:8889
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
      - pdf-service
      - agent-service
      - tts-service
    networks:
      - app-network

  # Agent service with GPU access
  agent-service:
    build:
      context: .
      dockerfile: services/AgentService/Dockerfile
    ports:
      - "8964:8964"
    environment:
      - NVIDIA_API_KEY=${NVIDIA_API_KEY}
      - REDIS_URL=redis://redis:6379
      - MODEL_CONFIG_PATH=/app/config/models.json
    volumes:
      - ./models.json:/app/config/models.json
    depends_on:
      - redis
    networks:
      - app-network

  # PDF processing service
  pdf-service:
    build:
      context: .
      dockerfile: services/PDFService/Dockerfile
    ports:
      - "8003:8003"
    environment:
      - REDIS_URL=redis://redis:6379
      - MODEL_API_URL=http://pdf-api:8004
    depends_on:
      - redis
      - pdf-api
    networks:
      - app-network

  # Celery worker for async tasks
  celery-worker:
    build:
      context: services/PDFService/PDFModelService
      dockerfile: Dockerfile.worker
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
    volumes:
      - pdf_temp:/tmp/pdf_conversions
    depends_on:
      - redis
    restart: unless-stopped
    networks:
      - app-network

  # Supporting services
  redis:
    image: redis:latest
    ports:
      - "6379:6379"
    command: redis-server --appendonly no
    networks:
      - app-network

  minio:
    image: minio/minio:latest
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
    volumes:
      - ./data/minio:/data
    command: minio server /data --console-address ":9001"
    networks:
      - app-network

  # Observability
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP GRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - app-network

volumes:
  pdf_temp:

networks:
  app-network:
    driver: bridge

Configuration Notes#

Services use custom Dockerfiles:

build.context sets the build directory. build.dockerfile specifies the Dockerfile path. AI Workbench builds these images when starting compose.

Services communicate through service names:

PDF_SERVICE_URL=http://pdf-service:8003 references the pdf-service by name. All services must be on the same network.

Redis provides message queue and caching:

Multiple services connect to the same Redis instance. Celery uses Redis as broker and result backend.

Volumes share data between services:

pdf_temp volume is shared between pdf-api and celery-worker. Bind mounts (./models.json) inject configuration files.

Dependencies ensure startup order:

depends_on starts redis before services that need it. Does not wait for services to be healthy unless healthchecks are defined.

Jaeger provides distributed tracing:

Services can send traces to Jaeger for observability. Jaeger UI accessible at http://localhost:16686.

Best Practices Across All Patterns#

Use named networks for better isolation:

Create explicit networks instead of relying on the default network. Makes service communication explicit and easier to debug.

Define healthchecks for critical services:

Prevents dependent services from starting before dependencies are ready. Use HTTP endpoints or simple commands that verify service readiness.

Use environment variables for configuration:

Reference secrets and configuration through ${VARIABLE_NAME} syntax. Set variables in AI Workbench or .env files. Never hardcode sensitive values in compose files.

Pin image versions in production:

Use specific tags (image:1.2.3) instead of latest. Ensures reproducible deployments across environments.

Use volumes for persistent data:

Model caches, databases, and application data should use volumes. Prevents data loss when containers restart.

Organize services with profiles:

Group related services into profiles for different deployment scenarios. Enables flexible deployments without maintaining multiple compose files.

Document GPU requirements clearly:

Comment GPU assignments and memory requirements. Helps users understand hardware requirements before deployment.

Use service names for inter-service communication:

Services on the same network can reach each other by service name. Avoid using localhost or IP addresses for service-to-service calls.