Quickstart | NVIDIA Dynamo Documentation

Choose Your Path

Quickstart

You’re here. Container fast path.

Local Installation

Full walkthrough — PyPI, configuration.

Kubernetes

Kubernetes-native production path.

Build from Source

For contributors against main.

Dynamo is backend-agnostic and Kubernetes-native without being Kubernetes-only. Use this container path to try the same frontend/router/worker stack locally; use the Kubernetes path when you want the operator, CRDs, Gateway API integration, autoscaling, scheduling, and cluster lifecycle management.

Run Dynamo Locally

Choose and install a build

Choose your local build

Unavailable combinations remain visible to show the current support boundary.

Hardware

Backend

Dynamo build

SGLang version

Install form

Latest stable release that supports this version

StableDynamo 1.3.0

NVIDIA GPU · SGLang 0.5.14

docker run --gpus all --network host --ipc host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.3.0

Hugging Face token required for gated models. Llama, Kimi, Qwen-VL, and other gated models require HF_TOKEN in your environment and accepting the model card’s license on huggingface.co. Set export HF_TOKEN=hf_… before launching.

The remaining steps run inside the selected container. If you installed an NVIDIA wheel instead, run the same python3 -m dynamo.* commands in your Python environment. See Local Installation for host prerequisites and virtual environment setup.

For published NVIDIA container versions and tags, see Release Artifacts.

Start the frontend

Start the OpenAI-compatible frontend on port 8000:

$ python3 -m dynamo.frontend --discovery-backend file

--discovery-backend file avoids needing etcd. To run the frontend and worker in the same terminal, background each command with > logfile.log 2>&1 &.

Start a worker

In another terminal, select the hardware and backend you installed, then launch the worker:

Choose your worker command

Select the same hardware and backend you used for the installation.

Hardware

Backend

Local worker

RunSGLang worker

NVIDIA GPU · Qwen/Qwen3-0.6B · file discovery

python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file

Verify the endpoint

Check that the endpoint is up:

$ curl -sf http://localhost:8000/health && echo OK

If you see OK, send a chat completion:

$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{"model": "Qwen/Qwen3-0.6B",
>        "messages": [{"role": "user", "content": "Hello!"}],
>        "max_tokens": 50}'

Connection refused? The frontend takes a few seconds to start — retry. For production liveness and readiness probes, see Health Check Reference.

From the Digest

Full-Stack Optimizations for Agentic Inference

How Dynamo optimizes for agentic workloads at three layers: the frontend API, the router, and KV cache management.

Flash Indexer: Inter-Galactic KV Routing

How Dynamo’s concurrent global index evolved through six iterations to sustain over 100M ops/sec.

Dive Deeper

Pick a full install path from the four options above, or explore how Dynamo works under the hood:

Architecture

How the frontend, router, and workers fit together.

Frontend Guide

Worker discovery, multi-model routing, OpenAI compat.

KV Cache Aware Routing

How the router places requests for prefix reuse.