vLLM
LLM Deployment using vLLM
Dynamo vLLM integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM’s native engine arguments. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
Installation
Install Latest Release
We recommend using uv to install:
This installs Dynamo with the compatible vLLM version.
Container
We have public images available on NGC Catalog:
Build from source
Development Setup
For development, use the devcontainer which has all dependencies pre-installed.
Feature Support Matrix
Quick Start
Start infrastructure services for local development:
Launch an aggregated serving deployment:
Next Steps
- Reference Guide: Configuration, arguments, and operational details
- Examples: All deployment patterns with launch scripts
- KV Cache Offloading: KVBM, LMCache, and FlexKV integrations
- Observability: Metrics and monitoring
- vLLM-Omni: Multimodal model serving
- Kubernetes Deployment: Kubernetes deployment guide
- vLLM Documentation: Upstream vLLM serve arguments