Local Installation
This guide walks through installing and running Dynamo on a local machine or VM with one or more GPUs. By the end, you’ll have a working OpenAI-compatible endpoint serving a model.
For production multi-node clusters, see the Kubernetes Deployment Guide. To build from source for development, see Building from Source.
System Requirements
TensorRT-LLM does not support Python 3.11.
For the full compatibility matrix including backend framework versions, see the Support Matrix.
Install Dynamo
Option A: Containers (Recommended)
Containers have all dependencies pre-installed. No setup required.
To run frontend and worker in the same container, either:
- Run processes in background with
&(see Run Dynamo section below), or - Open a second terminal and use
docker exec -it <container_id> bash
See Release Artifacts for available versions and backend guides for run instructions: SGLang | TensorRT-LLM | vLLM
Option B: Install from PyPI
Install system dependencies and the Dynamo wheel for your chosen backend:
SGLang
For CUDA 13 (B300/GB300), the container is recommended. See SGLang install docs for details.
TensorRT-LLM
TensorRT-LLM requires pip due to a transitive Git URL dependency that
uv doesn’t resolve. We recommend using the TensorRT-LLM container for
broader compatibility. See the TRT-LLM backend guide
for details.
vLLM
Run Dynamo
Discovery Backend
Dynamo components discover each other through a shared backend. Two options are available:
This guide uses --discovery-backend file. For etcd setup, see Service Discovery.
Verify Installation (Optional)
Verify the CLI is installed and callable:
If you cloned the repository, you can run additional system checks:
Start the Frontend
To run in a single terminal (useful in containers), append > logfile.log 2>&1 &
to run processes in background:
Start a Worker
In another terminal (or same terminal if using background mode), start a worker for your chosen backend:
SGLang
TensorRT-LLM
The warning Cannot connect to ModelExpress server/transport error. Using direct download.
is expected in local deployments and can be safely ignored.
vLLM
KV Events Configuration
For dependency-free local development, disable KV event publishing (avoids NATS):
- vLLM: Add
--kv-events-config '{"enable_kv_cache_events": false}' - SGLang: No flag needed (KV events disabled by default)
- TensorRT-LLM: No flag needed (KV events disabled by default)
vLLM automatically enables KV event publishing when prefix caching is active. In a future release, KV events will be disabled by default for all backends. Start using --kv-events-config explicitly to prepare.
Test Your Deployment
Troubleshooting
CUDA/driver version mismatch
Run nvidia-smi to check your driver version. Dynamo requires driver 575.51.03+ for CUDA 12 or 580.00.03+ for CUDA 13. B300/GB300 GPUs require CUDA 13. See the Support Matrix for full requirements.
Model doesn’t fit on GPU (OOM)
The default model Qwen/Qwen3-0.6B requires ~2GB of GPU memory. Larger models need more VRAM:
Start with a small model and scale up based on your hardware.
Python 3.11 with TensorRT-LLM
TensorRT-LLM does not support Python 3.11. If you see installation failures with TensorRT-LLM, check your Python version with python3 --version. Use Python 3.10 or 3.12 instead.
Container runs but GPU not detected
Ensure you passed --gpus all to docker run. Without this flag, the container won’t have access to GPUs:
Next Steps
- Backend Guides — Backend-specific configuration and features
- Disaggregated Serving — Scale prefill and decode independently
- KV Cache Aware Routing — Smart request routing
- Kubernetes Deployment — Production multi-node deployments