LLM Deployment using vLLM#
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM’s native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
Use the Latest Release#
We recommend using the latest stable release of Dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
Table of Contents#
Feature Support Matrix#
Core Dynamo Features#
Feature |
vLLM |
Notes |
---|---|---|
✅ |
||
🚧 |
WIP |
|
✅ |
||
✅ |
||
🚧 |
WIP |
|
🚧 |
WIP |
Large Scale P/D and WideEP Features#
Feature |
vLLM |
Notes |
---|---|---|
WideEP |
✅ |
Support for PPLX / DeepEP not verified |
Attention DP |
✅ |
Supported via external control of DP ranks |
GB200 Support |
🚧 |
Container functional on main |
Quick Start#
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
Start NATS and ETCD in the background#
Start using Docker Compose
docker compose -f deploy/docker-compose.yml up -d
Pull or build container#
We have public images available on NGC Catalog. If you’d like to build your own container from source:
./container/build.sh --framework VLLM
Run container#
./container/run.sh -it --framework VLLM [--mount-workspace]
This includes the specific commit vllm-project/vllm#19790 which enables support for external control of the DP ranks.
Run Single Node Examples#
Important
Below we provide simple shell scripts that run the components for each configuration. Each shell script runs python3 -m dynamo.frontend
to start the ingress and uses python3 -m dynamo.vllm
to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
This figure shows an overview of the major components to deploy:
+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| dynamo |----->| vLLM Worker |------------>| vLLM Prefill |
| |<-----| ingress |<-----| |<------------| Worker |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
Aggregated Serving#
# requires one gpu
cd components/backends/vllm
bash launch/agg.sh
Aggregated Serving with KV Routing#
# requires two gpus
cd components/backends/vllm
bash launch/agg_router.sh
Disaggregated Serving#
# requires two gpus
cd components/backends/vllm
bash launch/disagg.sh
Disaggregated Serving with KV Routing#
# requires three gpus
cd components/backends/vllm
bash launch/disagg_router.sh
Single Node Data Parallel Attention / Expert Parallelism#
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
# requires four gpus
cd components/backends/vllm
bash launch/dep.sh
Tip
Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
Advanced Examples#
Below we provide a selected list of advanced deployments. Please open up an issue if you’d like to see a specific example!
Kubernetes Deployment#
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see vLLM Kubernetes Deployment Guide
Configuration#
vLLM workers are configured through command-line arguments. Key parameters include:
--endpoint
: Dynamo endpoint in formatdyn://namespace.component.endpoint
--model
: Model to serve (e.g.,Qwen/Qwen3-0.6B
)--is-prefill-worker
: Enable prefill-only mode for disaggregated serving--metrics-endpoint-port
: Port for publishing KV metrics to Dynamo
See args.py
for the full list of configuration options and their defaults.
The documentation for the vLLM CLI args points to running ‘vllm serve –help’ to see what CLI args can be added. We use the same argument parser as vLLM.
Request Migration#
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.
The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
For ongoing requests, there is a --migration-limit
flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
For example,
python3 -m dynamo.vllm ... --migration-limit=3
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.