Multi-node Examples#

This guide covers deploying vLLM across multiple nodes using Dynamo’s distributed capabilities.

Prerequisites#

Multi-node deployments require:

  • Multiple nodes with GPU resources

  • Network connectivity between nodes (faster the better)

  • Firewall rules allowing NATS/ETCD communication

Infrastructure Setup#

Step 1: Start NATS/ETCD on Head Node#

Start the required services on your head node. These endpoints must be accessible from all worker nodes:

# On head node (node-1)
docker compose -f deploy/docker-compose.yml up -d

Default ports:

  • NATS: 4222

  • ETCD: 2379

Step 2: Configure Environment Variables#

Set the head node IP address and service endpoints. Set this on all nodes for easy copy-paste:

# Set this on ALL nodes - replace with your actual head node IP
export HEAD_NODE_IP="<your-head-node-ip>"

# Service endpoints (set on all nodes)
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"

Deployment Patterns#

Multi-node Aggregated Serving#

Deploy vLLM workers across multiple nodes for horizontal scaling:

Node 1 (Head Node): Run ingress and first worker

# Start ingress
python -m dynamo.frontend --router-mode kv

# Start vLLM worker
python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager

Node 2: Run additional worker

# Start vLLM worker
python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enforce-eager

Multi-node Disaggregated Serving#

Deploy prefill and decode workers on separate nodes for optimized resource utilization:

Node 1: Run ingress and prefill workers

# Start ingress
python -m dynamo.frontend --router-mode kv &

# Start prefill worker
python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager

Node 2: Run decode workers

# Start decode worker
python -m dynamo.vllm \
  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager \
  --is-prefill-worker

TODO#

Large Model Deployment#

For models requiring more GPUs than available on a single node such as tensor-parallel-size 16:

Node 1: First part of tensor-parallel model

# Start ingress
python -m dynamo.frontend --router-mode kv &