For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
      • Reference Guide
      • Examples
      • Observability
      • Diffusion (Experimental)
      • Known Issues and Mitigations
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Use the Latest Release
  • Feature Support Matrix
  • Core Dynamo Features
  • Large Scale P/D and WideEP Features
  • Quick Start
  • Kubernetes Deployment
  • Next Steps
Backends

TensorRT-LLM

||View as Markdown|
Previous

SGLang for Agentic Workloads

Next

Reference Guide

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes.


Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.

Feature Support Matrix

Core Dynamo Features

FeatureTensorRT-LLMNotes
Disaggregated Serving✅
Conditional Disaggregation🚧Not supported yet
KV-Aware Routing✅
SLA-Based Planner✅
Load Based Planner🚧Planned
KVBM✅

Large Scale P/D and WideEP Features

FeatureTensorRT-LLMNotes
WideEP✅
DP Rank Routing✅
GB200 Support✅

Quick Start

Step 1 (host terminal): Start infrastructure services:

$docker compose -f deploy/docker-compose.yml up -d

Step 2 (host terminal): Pull and run the prebuilt container:

$DYNAMO_VERSION=1.0.0
$docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
$docker run --gpus all -it --network host --ipc host \
> nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION

The DYNAMO_VERSION variable above can be set to any specific available version of the container. To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.

Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.

Step 4 (host terminal): Verify the deployment:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
> "stream": true,
> "max_tokens": 30
> }'

Kubernetes Deployment

You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.

Next Steps

  • Reference Guide: Features, configuration, and operational details
  • Examples: All deployment patterns with launch scripts
  • KV Cache Transfer: KV cache transfer methods for disaggregated serving
  • Observability: Metrics and monitoring
  • Multinode Examples: Multi-node deployment with SLURM
  • Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide