TensorRT-LLM
Use the Latest Release
We recommend using the latest stable release of Dynamo to avoid breaking changes.
Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
Feature Support Matrix
Core Dynamo Features
Large Scale P/D and WideEP Features
Quick Start
Step 1 (host terminal): Start infrastructure services:
Step 2 (host terminal): Pull and run the prebuilt container:
The DYNAMO_VERSION variable above can be set to any specific available version of the container.
To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.
Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):
The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.
Step 4 (host terminal): Verify the deployment:
Kubernetes Deployment
You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.
Next Steps
- Reference Guide: Features, configuration, and operational details
- Examples: All deployment patterns with launch scripts
- KV Cache Transfer: KV cache transfer methods for disaggregated serving
- Prometheus Metrics: Metrics and monitoring
- Multinode Examples: Multi-node deployment with SLURM
- Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide