Local Installation | NVIDIA Dynamo Documentation

This guide walks through installing and running Dynamo on a local machine or VM with one or more GPUs. By the end, you’ll have a working OpenAI-compatible endpoint serving a model.

For production multi-node clusters, see the Kubernetes Deployment Guide. To build from source for development, see Building from Source.

System Requirements

Requirement	Supported
GPU	NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell
OS	Ubuntu 22.04, Ubuntu 24.04
Architecture	x86_64, ARM64 (ARM64 requires Ubuntu 24.04)
CUDA	12.9+ or 13.0+ (B300/GB300 require CUDA 13)
Python	3.10, 3.12
Driver	575.51.03+ (CUDA 12) or 580.00.03+ (CUDA 13)

TensorRT-LLM does not support Python 3.11.

For the full compatibility matrix including backend framework versions, see the Support Matrix.

Install Dynamo

Option A: Containers (Recommended)

Containers have all dependencies pre-installed. No setup required.

$ # SGLang
$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
$ 
$ # TensorRT-LLM
$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0
$ 
$ # vLLM
$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0

To run frontend and worker in the same container, either:

Run processes in background with & (see Run Dynamo section below), or
Open a second terminal and use docker exec -it <container_id> bash

See Release Artifacts for available versions and backend guides for run instructions: SGLang | TensorRT-LLM | vLLM

Option B: Install from PyPI

$ # Install uv (recommended Python package manager)
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ 
$ # Create virtual environment
$ uv venv venv
$ source venv/bin/activate
$ uv pip install pip

Install system dependencies and the Dynamo wheel for your chosen backend:

SGLang

$ sudo apt install python3-dev
$ uv pip install --prerelease=allow "ai-dynamo[sglang]"

For CUDA 13 (B300/GB300), the container is recommended. See SGLang install docs for details.

TensorRT-LLM

$ sudo apt install python3-dev
$ pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
$ pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"

TensorRT-LLM requires pip due to a transitive Git URL dependency that uv doesn’t resolve. We recommend using the TensorRT-LLM container for broader compatibility. See the TRT-LLM backend guide for details.

vLLM

$ sudo apt install python3-dev libxcb1
$ uv pip install --prerelease=allow "ai-dynamo[vllm]"

Run Dynamo

Discovery Backend

Dynamo components discover each other through a shared backend. Two options are available:

Backend	When to Use	Setup
File	Single machine, local development	No setup — pass `--discovery-backend file` to all components. The event plane automatically defaults to ZMQ (no NATS required).
etcd	Multi-node, production	Requires a running etcd instance (default if no flag is specified). The event plane defaults to NATS.

This guide uses --discovery-backend file. For etcd setup, see Service Discovery.

Verify Installation (Optional)

Verify the CLI is installed and callable:

$ python3 -m dynamo.frontend --help

If you cloned the repository, you can run additional system checks:

$ python3 dev/sanity_check.py

Start the Frontend

$ # Start the OpenAI compatible frontend (default port is 8000)
$ python3 -m dynamo.frontend --discovery-backend file

To run in a single terminal (useful in containers), append > logfile.log 2>&1 & to run processes in background:

$ python3 -m dynamo.frontend --discovery-backend file > dynamo.frontend.log 2>&1 &

Start a Worker

In another terminal (or same terminal if using background mode), start a worker for your chosen backend:

SGLang

$ python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file

TensorRT-LLM

$ python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file

The warning Cannot connect to ModelExpress server/transport error. Using direct download. is expected in this local single-machine setup (no ModelExpress server running) and can be safely ignored. In a Kubernetes deployment where MODEL_EXPRESS_URL is configured, this warning — or the related Failed to resolve local model path after server download — indicates that ModelExpress is configured but is not actually serving cached models; see Model Caching in Kubernetes for the correct configuration.

vLLM

$ python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
>   --kv-events-config '{"enable_kv_cache_events": false}'

KV Events Configuration

For dependency-free local development, disable KV event publishing (avoids NATS):

vLLM: Add --kv-events-config '{"enable_kv_cache_events": false}'
SGLang: No flag needed (KV events disabled by default)
TensorRT-LLM: No flag needed (KV events disabled by default)

KV events are disabled by default for all backends. For vLLM and SGLang, add backend-specific --kv-events-config only when you want KV event publishing enabled. For TensorRT-LLM, enable event publishing with --publish-events-and-metrics.

Test Your Deployment

$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{"model": "Qwen/Qwen3-0.6B",
>        "messages": [{"role": "user", "content": "Hello!"}],
>        "max_tokens": 50}'

Troubleshooting

CUDA/driver version mismatch

Run nvidia-smi to check your driver version. Dynamo requires driver 575.51.03+ for CUDA 12 or 580.00.03+ for CUDA 13. B300/GB300 GPUs require CUDA 13. See the Support Matrix for full requirements.

Model doesn’t fit on GPU (OOM)

The default model Qwen/Qwen3-0.6B requires ~2GB of GPU memory. Larger models need more VRAM:

Model Size	Approximate VRAM
7B	14-16 GB
13B	26-28 GB
70B	140+ GB (multi-GPU)

Start with a small model and scale up based on your hardware.

Python 3.11 with TensorRT-LLM

TensorRT-LLM does not support Python 3.11. If you see installation failures with TensorRT-LLM, check your Python version with python3 --version. Use Python 3.10 or 3.12 instead.

Container runs but GPU not detected

Ensure you passed --gpus all to docker run. Without this flag, the container won’t have access to GPUs:

$ # Correct
$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
$ 
$ # Wrong -- no GPU access
$ docker run --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0

Next Steps

Backend Guides — Backend-specific configuration and features
Disaggregated Serving — Scale prefill and decode independently
KV Cache Aware Routing — Smart request routing
Kubernetes Deployment — Production multi-node deployments