TensorRT-LLM

View as MarkdownOpen in Claude

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes.


Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.

Feature Support Matrix

Core Dynamo Features

FeatureTensorRT-LLMNotes
Disaggregated Serving
Conditional Disaggregation🚧Not supported yet
KV-Aware Routing
SLA-Based Planner
Load Based Planner🚧Planned
KVBM

Large Scale P/D and WideEP Features

FeatureTensorRT-LLMNotes
WideEP
DP Rank Routing
GB200 Support

Prerequisites

  • yq for in-place YAML edits. Install with wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq or pip install yq (the latter is a different tool with the same name but similar syntax). If neither is available, a sed fallback is shown inline where yq is used.

Container / driver matrix

Container tagBackend versionCUDAMin NVIDIA driver
tensorrtllm-runtime:1.0.2TRT-LLM v1.3.0rc5.post1v13.1580+
vllm-runtime:1.0.2vLLM v0.16.0v12.9575+
vllm-runtime:1.0.2-cuda13vLLM v0.16.0v13.0580+
sglang-runtime:1.0.2SGLang v0.5.9v12.9575+
sglang-runtime:1.0.2-cuda13SGLang v0.5.9v13.0580+

Source of truth: docs/reference/support-matrix.md and docs/reference/release-artifacts.md. If those differ from the values above, the source-of-truth files win.

Quick Start

Step 1 (host terminal): Start infrastructure services:

$docker compose -f dev/docker-compose.yml up -d

Step 2 (host terminal): Pull and run the prebuilt container:

$DYNAMO_VERSION=1.0.2
$docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
$docker run --gpus all -it --network host --ipc host \
> nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION

The DYNAMO_VERSION variable above can be set to any specific available version of the container. To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.

Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.

Step 4 (host terminal): Verify the deployment:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
> "stream": true,
> "max_tokens": 30
> }'

Deploy

Deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. Before kubectl apply, substitute the container image tag in the deployment YAML. The sed fallback is shown inline for environments without yq:

$# yq
$yq -i '(.spec.services[].extraPodSpec.mainContainer.image) |= sub(":1\.0\.2", ":<your-tag>")' deploy.yaml
$# sed fallback
$sed -i.bak 's|:1\.0\.2|:<your-tag>|g' deploy.yaml

For full Kubernetes deployment instructions, see the TensorRT-LLM Kubernetes Deployment Guide.

Next Steps