For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Choose Your Path
  • Pull a Container
  • Start the Frontend
  • Start a Worker
  • Verify and Test
  • From the Digest
  • Dive Deeper
Getting Started

Quickstart

Get a Dynamo OpenAI-compatible endpoint running in a container in about 5 minutes.

||View as Markdown|
Edit this page
Next

Introduction

Choose Your Path

Quickstart

You’re here. Container fast path.

Local Installation

Full walkthrough — PyPI, configuration.

Kubernetes

Production multi-node clusters.

Build from Source

For contributors against main.

Dynamo is backend-agnostic — every install path works with SGLang, TensorRT-LLM, and vLLM. Pick the install path that fits your environment, then choose your backend.

Pull a Container

Containers have all dependencies pre-installed. Pick your backend:

SGLang
TensorRT-LLM
vLLM
$docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.2

Hugging Face token required for gated models. Llama, Kimi, Qwen-VL, and other gated models require HF_TOKEN in your environment and accepting the model card’s license on huggingface.co. Set export HF_TOKEN=hf_… before launching.

For container versions and tags, see Release Artifacts.

Start the Frontend

In your container, start the OpenAI-compatible frontend on port 8000:

$python3 -m dynamo.frontend --discovery-backend file

--discovery-backend file avoids needing etcd. To run frontend and worker in the same terminal, background each command with > logfile.log 2>&1 &.

Start a Worker

In another terminal, launch a worker for your backend:

SGLang
TensorRT-LLM
vLLM
$python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file

Verify and Test

Check the endpoint is up:

$curl -sf http://localhost:8000/health && echo OK

If you see OK, send a chat completion:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "Hello!"}],
> "max_tokens": 50}'

Connection refused? The frontend takes a few seconds to start — retry. For production liveness and readiness probes, see Health Checks.

From the Digest

Full-Stack Optimizations for Agentic Inference

How Dynamo optimizes for agentic workloads at three layers: the frontend API, the router, and KV cache management.

Flash Indexer: Inter-Galactic KV Routing

How Dynamo’s concurrent global index evolved through six iterations to sustain over 100M ops/sec.

Dive Deeper

Pick a full install path from the four options above, or explore how Dynamo works under the hood:

Architecture

How the frontend, router, and workers fit together.

Frontend Guide

Worker discovery, multi-model routing, OpenAI compat.

KV Cache Aware Routing

How the router places requests for prefix reuse.

Health Checks

Liveness and readiness probes for production deployments.