TensorRT-LLM

View as Markdown

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes.


Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.

Feature Support Matrix

Core Dynamo Features

FeatureTensorRT-LLMNotes
Disaggregated Serving
Conditional Disaggregation🚧Not supported yet
KV-Aware Routing
SLA-Based Planner
Load Based Planner🚧Planned
KVBM

Large Scale P/D and WideEP Features

FeatureTensorRT-LLMNotes
WideEP
DP Rank Routing
GB200 Support

Quick Start

Step 1 (host terminal): Start infrastructure services:

$docker compose -f deploy/docker-compose.yml up -d

Step 2 (host terminal): Pull and run the prebuilt container:

$DYNAMO_VERSION=1.0.0
$docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
$docker run --gpus all -it --network host --ipc host \
> nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION

The DYNAMO_VERSION variable above can be set to any specific available version of the container. To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.

Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.

Step 4 (host terminal): Verify the deployment:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
> "stream": true,
> "max_tokens": 30
> }'

Kubernetes Deployment

You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.

Next Steps