For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
      • Reference Guide
      • Examples
      • Prometheus Metrics
      • Video Diffusion (Experimental)
      • Known Issues and Mitigations
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Table of Contents
  • Infrastructure Setup
  • Single Node Examples
  • Aggregated
  • Aggregated with KV Routing
  • Disaggregated
  • Disaggregated with KV Routing
  • Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
  • Advanced Examples
  • Multinode Deployment
  • Speculative Decoding
  • Model-Specific Guides
  • Kubernetes Deployment
  • Performance Sweep
  • Client
  • Benchmarking
BackendsTensorRT-LLM

Examples

||View as Markdown|
Edit this page
Previous

Reference Guide

Next

Prometheus

For quick start instructions, see the TensorRT-LLM README. This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.

Table of Contents

  • Infrastructure Setup
  • Single Node Examples
  • Advanced Examples
  • Client
  • Benchmarking

Infrastructure Setup

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

$docker compose -f deploy/docker-compose.yml up -d
  • etcd is optional but is the default local discovery backend. You can also use --discovery-backend file to use file system based discovery.
  • NATS is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use --no-router-kv-events on the frontend for prediction-based routing without events.
  • On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD).

Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs python3 -m dynamo.frontend <args> to start up the ingress and python3 -m dynamo.trtllm <args> to start up the workers.

For detailed information about the architecture and how KV-aware routing works, see the Router Guide.

Single Node Examples

Aggregated

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

Aggregated with KV Routing

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg_router.sh

Disaggregated

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg.sh

Disaggregated with KV Routing

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg_router.sh

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
$export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
$# nvidia/DeepSeek-R1-FP4 is a large model
$export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
$./launch/agg.sh
  • There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
  • MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples

Multinode Deployment

For comprehensive instructions on multinode serving, see the Multinode Examples guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the Llama4 + Eagle guide to learn how to use these scripts when a single worker fits on a single node.

Speculative Decoding

  • Llama 4 Maverick Instruct + Eagle Speculative Decoding

Model-Specific Guides

  • Gemma3 with Sliding Window Attention
  • GPT-OSS-120b — Reasoning model with tool calling support

Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the TensorRT-LLM Kubernetes Deployment Guide.

Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model.

Client

See the client section to learn how to send requests to the deployment.

To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh