vLLM | NVIDIA Dynamo Documentation

Dynamo vLLM integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM’s native engine arguments. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

Installation

Install Latest Release

We recommend using uv to install:

$ uv venv --python 3.12 --seed
$ uv pip install "ai-dynamo[vllm]"

This installs Dynamo with the compatible vLLM version.

Container

We have public images available on NGC Catalog:

$ docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
$ ./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>

Build from source

$ python container/render.py --framework vllm --output-short-filename
$ docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .

$ ./container/run.sh -it --framework VLLM [--mount-workspace]

Development Setup

For development, use the devcontainer which has all dependencies pre-installed.

Feature Support Matrix

Feature	Status	Notes
Disaggregated Serving	✅	Prefill/decode separation with NIXL KV transfer
KV-Aware Routing	✅
SLA-Based Planner	✅
KVBM	✅
LMCache	✅
FlexKV	✅
Multimodal Support	✅	Via vLLM-Omni integration
Observability	✅	Metrics and monitoring
WideEP	✅	Support for DeepEP
DP Rank Routing	✅	Hybrid load balancing via external DP rank control
LoRA	✅	Dynamic loading/unloading from S3-compatible storage
GB200 Support	✅	Container functional on main

Quick Start

Start infrastructure services for local development:

$ docker compose -f deploy/docker-compose.yml up -d

Launch an aggregated serving deployment:

$ cd $DYNAMO_HOME/examples/backends/vllm
$ bash launch/agg.sh

Next Steps

Reference Guide: Configuration, arguments, and operational details
Examples: All deployment patterns with launch scripts
KV Cache Offloading: KVBM, LMCache, and FlexKV integrations
Observability: Metrics and monitoring
vLLM-Omni: Multimodal model serving
Kubernetes Deployment: Kubernetes deployment guide
vLLM Documentation: Upstream vLLM serve arguments