For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
      • Reference Guide
      • Examples
      • Prometheus Metrics
      • Video Diffusion (Experimental)
      • Known Issues and Mitigations
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Use the Latest Release
  • Feature Support Matrix
  • Core Dynamo Features
  • Large Scale P/D and WideEP Features
  • Quick Start
  • Kubernetes Deployment
  • Next Steps
Backends

TensorRT-LLM

||View as Markdown|
Edit this page
Previous

SGLang for Agentic Workloads

Next

Reference Guide

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes.


Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.

Feature Support Matrix

Core Dynamo Features

FeatureTensorRT-LLMNotes
Disaggregated Serving✅
Conditional Disaggregation🚧Not supported yet
KV-Aware Routing✅
SLA-Based Planner✅
Load Based Planner🚧Planned
KVBM✅

Large Scale P/D and WideEP Features

FeatureTensorRT-LLMNotes
WideEP✅
DP Rank Routing✅
GB200 Support✅

Quick Start

Step 1 (host terminal): Start infrastructure services:

$docker compose -f deploy/docker-compose.yml up -d

Step 2 (host terminal): Pull and run the prebuilt container:

$DYNAMO_VERSION=1.0.0
$docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
$docker run --gpus all -it --network host --ipc host \
> nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION

The DYNAMO_VERSION variable above can be set to any specific available version of the container. To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.

Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.

Step 4 (host terminal): Verify the deployment:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
> "stream": true,
> "max_tokens": 30
> }'

Kubernetes Deployment

You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.

Next Steps

  • Reference Guide: Features, configuration, and operational details
  • Examples: All deployment patterns with launch scripts
  • KV Cache Transfer: KV cache transfer methods for disaggregated serving
  • Prometheus Metrics: Metrics and monitoring
  • Multinode Examples: Multi-node deployment with SLURM
  • Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide