LLM Deployment using TensorRT-LLM#

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

Use the Latest Release#

We recommend using the latest stable release of dynamo to avoid breaking changes:

GitHub Release

You can find the latest release here and check out the corresponding branch with:

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Table of Contents#

Feature Support Matrix#

Core Dynamo Features#

Feature

TensorRT-LLM

Notes

Disaggregated Serving

Conditional Disaggregation

🚧

Not supported yet

KV-Aware Routing

SLA-Based Planner

Load Based Planner

🚧

Planned

KVBM

Large Scale P/D and WideEP Features#

Feature

TensorRT-LLM

Notes

WideEP

DP Rank Routing

GB200 Support

TensorRT-LLM Quick Start#

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Start NATS and ETCD in the background#

Start using Docker Compose

docker compose -f deploy/docker-compose.yml up -d

Build container#

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

# On an x86 machine:
./container/build.sh --framework trtllm

# On an ARM machine:
./container/build.sh --framework trtllm --platform linux/arm64

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main

Run container#

./container/run.sh --framework trtllm -it

Single Node Examples#

Important

Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the python3 -m dynamo.frontend <args> to start up the ingress and using python3 -m dynamo.trtllm <args> to start up the workers. You can easily take each command and run them in separate terminals.

This figure shows an overview of the major components to deploy:

+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
|      |<-----|           |<-----|                  |<------------|               |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+

Note: The diagram above shows all possible components in a deployment. In disaggregated serving, Worker1 acts as the decode worker and Worker2 as the prefill worker, with the unified frontend coordinating request routing between them.

Aggregated#

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh

Aggregated with KV Routing#

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh

Disaggregated#

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh

Disaggregated with KV Routing#

Important

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1#

cd $DYNAMO_HOME/examples/backends/trtllm

export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh

Notes:

  • There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.

  • MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples#

Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!

Multinode Deployment#

For comprehensive instructions on multinode serving, see the multinode-examples.md guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see Llama4+eagle guide to learn how to use these scripts when a single worker fits on the single node.

Speculative Decoding#

Kubernetes Deployment#

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see TensorRT-LLM Kubernetes Deployment Guide.

Client#

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking#

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

KV Cache Transfer in Disaggregated Serving#

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV cache transfer guide.

Request Migration#

You can enable request migration to handle worker failures gracefully. Use the --migration-limit flag to specify how many times a request can be migrated to another worker:

# For decode and aggregated workers
python3 -m dynamo.trtllm ... --migration-limit=3

Important

Prefill workers do not support request migration and must use --migration-limit=0 (the default). Prefill workers only process prompts and return KV cache state - they don’t maintain long-running generation requests that would benefit from migration.

See the Request Migration Architecture documentation for details on how this works.

Request Cancellation#

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix#

Prefill

Decode

Aggregated

Disaggregated

For more details, see the Request Cancellation Architecture documentation.

Client#

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking#

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

Multimodal support#

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the Multimodal Support Guide.

Logits Processing#

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

How it works#

  • Interface: Implement dynamo.logits_processing.BaseLogitsProcessor which defines __call__(input_ids, logits) and modifies logits in-place.

  • TRT-LLM adapter: Use dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...) to convert Dynamo processors into TRT-LLM-compatible processors and assign them to SamplingParams.logits_processor.

  • Examples: See example processors in lib/bindings/python/src/dynamo/logits_processing/examples/ (temperature, hello_world).

Quick test: HelloWorld processor#

You can enable a test-only processor that forces the model to respond with “Hello world!”. This is useful to verify the wiring without modifying your model or engine code.

cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh

Notes:

  • When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.

  • Expected chat response contains “Hello world”.

Bring your own processor#

Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:

from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor

class TemperatureProcessor(BaseLogitsProcessor):
    def __init__(self, temperature: float = 1.0):
        if temperature <= 0:
            raise ValueError("Temperature must be positive")
        self.temperature = temperature

    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
        if self.temperature == 1.0:
            return
        logits.div_(self.temperature)

Wire it into TRT-LLM by adapting and attaching to SamplingParams:

from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor

processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)

Current limitations#

  • Per-request processing only (batch size must be 1); beam width > 1 is not supported.

  • Processors must modify logits in-place and not return a new tensor.

  • If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

Performance Sweep#

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model. This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.

Dynamo KV Block Manager Integration#

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: Running KVBM in TensorRT-LLM .