--- title: vLLM --- # LLM Deployment using vLLM Dynamo vLLM integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM's native engine arguments. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. ## Installation ### Install Latest Release We recommend using [uv](https://github.com/astral-sh/uv) to install: ```bash uv venv --python 3.12 --seed uv pip install "ai-dynamo[vllm]" ``` This installs Dynamo with the compatible vLLM version. --- ### Container We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts): ```bash docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime: ./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime: ``` ```bash python container/render.py --framework vllm --output-short-filename docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm . ``` ```bash ./container/run.sh -it --framework VLLM [--mount-workspace] ``` ### Development Setup For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed. ## Feature Support Matrix | Feature | Status | Notes | |---------|--------|-------| | [**Disaggregated Serving**](/dynamo/dev/design-docs/disaggregated-serving) | ✅ | Prefill/decode separation with NIXL KV transfer | | [**KV-Aware Routing**](/dynamo/dev/components/router) | ✅ | | | [**SLA-Based Planner**](/dynamo/dev/components/planner/planner-guide) | ✅ | | | [**KVBM**](/dynamo/dev/components/kvbm) | ✅ | | | [**LMCache**](/dynamo/dev/integrations/lm-cache) | ✅ | | | [**FlexKV**](/dynamo/dev/integrations/flex-kv) | ✅ | | | [**Multimodal Support**](/dynamo/dev/user-guides/diffusion/v-llm-omni) | ✅ | Via vLLM-Omni integration | | [**Observability**](/dynamo/dev/additional-resources/v-llm-details/observability) | ✅ | Metrics and monitoring | | **WideEP** | ✅ | Support for DeepEP | | **DP Rank Routing** | ✅ | [Hybrid load balancing](https://docs.vllm.ai/en/stable/serving/data_parallel_deployment/?h=external+dp#hybrid-load-balancing) via external DP rank control | | [**LoRA**](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | ✅ | Dynamic loading/unloading from S3-compatible storage | | **GB200 Support** | ✅ | Container functional on main | ## Quick Start Start infrastructure services for local development: ```bash docker compose -f deploy/docker-compose.yml up -d ``` Launch an aggregated serving deployment: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg.sh ``` ## Next Steps - **[Reference Guide](/dynamo/dev/additional-resources/v-llm-details/reference-guide)**: Configuration, arguments, and operational details - **[Examples](/dynamo/dev/additional-resources/v-llm-details/examples)**: All deployment patterns with launch scripts - **[KV Cache Offloading](/dynamo/dev/additional-resources/v-llm-details/kv-cache-offloading)**: KVBM, LMCache, and FlexKV integrations - **[Observability](/dynamo/dev/additional-resources/v-llm-details/observability)**: Metrics and monitoring - **[vLLM-Omni](/dynamo/dev/user-guides/diffusion/v-llm-omni)**: Multimodal model serving - **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide - **[vLLM Documentation](https://docs.vllm.ai/en/stable/)**: Upstream vLLM serve arguments