For quick start instructions, see the TensorRT-LLM README. This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
--discovery-backend file to use file system based discovery.--no-router-kv-events on the frontend for prediction-based routing without events.DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD).Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs python3 -m dynamo.frontend <args> to start up the ingress and python3 -m dynamo.trtllm <args> to start up the workers.
For detailed information about the architecture and how KV-aware routing works, see the Router Guide.
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.For comprehensive instructions on multinode serving, see the Multinode Examples guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the Llama4 + Eagle guide to learn how to use these scripts when a single worker fits on a single node.
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the TensorRT-LLM Kubernetes Deployment Guide.
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model.
See the client section to learn how to send requests to the deployment.
To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh