TensorRT-LLM
TensorRT-LLM
TensorRT-LLM
We recommend using the latest stable release of Dynamo to avoid breaking changes.
Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
Step 1 (host terminal): Start infrastructure services:
Step 2 (host terminal): Pull and run the prebuilt container:
The DYNAMO_VERSION variable above can be set to any specific available version of the container.
To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.
Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):
The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.
Step 4 (host terminal): Verify the deployment:
You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.