For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • LLM Deployment using vLLM
  • Installation
  • Install Latest Release
  • Container
  • Development Setup
  • Feature Support Matrix
  • Quick Start
  • Next Steps
Backends

vLLM

||View as Markdown|
Edit this page
Previous

Known Issues and Mitigations

Next

Frontend

LLM Deployment using vLLM

Dynamo vLLM integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM’s native engine arguments. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

Installation

Install Latest Release

We recommend using uv to install:

$uv venv --python 3.12 --seed
$uv pip install "ai-dynamo[vllm]"

This installs Dynamo with the compatible vLLM version.


Container

We have public images available on NGC Catalog:

$docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
$./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
Build from source
$python container/render.py --framework vllm --output-short-filename
$docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .
$./container/run.sh -it --framework VLLM [--mount-workspace]

Development Setup

For development, use the devcontainer which has all dependencies pre-installed.

Feature Support Matrix

FeatureStatusNotes
Disaggregated Serving✅Prefill/decode separation with NIXL KV transfer
KV-Aware Routing✅
SLA-Based Planner✅
KVBM✅
LMCache✅
FlexKV✅
Multimodal Support✅Via vLLM-Omni integration
Observability✅Metrics and monitoring
WideEP✅Support for DeepEP
DP Rank Routing✅Hybrid load balancing via external DP rank control
LoRA✅Dynamic loading/unloading from S3-compatible storage
GB200 Support✅Container functional on main

Quick Start

Start infrastructure services for local development:

$docker compose -f deploy/docker-compose.yml up -d

Launch an aggregated serving deployment:

$cd $DYNAMO_HOME/examples/backends/vllm
$bash launch/agg.sh

Next Steps

  • Reference Guide: Configuration, arguments, and operational details
  • Examples: All deployment patterns with launch scripts
  • KV Cache Offloading: KVBM, LMCache, and FlexKV integrations
  • Observability: Metrics and monitoring
  • vLLM-Omni: Multimodal model serving
  • Kubernetes Deployment: Kubernetes deployment guide
  • vLLM Documentation: Upstream vLLM serve arguments