Examples | NVIDIA Dynamo Documentation

For quick start instructions, see the TensorRT-LLM README. This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.

Infrastructure Setup

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

$ docker compose -f deploy/docker-compose.yml up -d

etcd is optional but is the default local discovery backend. You can also use --discovery-backend file to use file system based discovery.
NATS is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use --no-router-kv-events on the frontend for prediction-based routing without events.
On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD).

Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs python3 -m dynamo.frontend <args> to start up the ingress and python3 -m dynamo.trtllm <args> to start up the workers.

For detailed information about the architecture and how KV-aware routing works, see the Router Guide.

Single Node Examples

Aggregated

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ ./launch/agg.sh

Aggregated with KV Routing

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ ./launch/agg_router.sh

Disaggregated

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ ./launch/disagg.sh

Disaggregated with KV Routing

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ ./launch/disagg_router.sh

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ 
$ export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
$ export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
$ # nvidia/DeepSeek-R1-FP4 is a large model
$ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
$ ./launch/agg.sh

There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples

Multinode Deployment

For comprehensive instructions on multinode serving, see the Multinode Examples guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the Llama4 + Eagle guide to learn how to use these scripts when a single worker fits on a single node.

Speculative Decoding

Llama 4 Maverick Instruct + Eagle Speculative Decoding

Model-Specific Guides

Gemma3 with Sliding Window Attention
GPT-OSS-120b — Reasoning model with tool calling support

Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the TensorRT-LLM Kubernetes Deployment Guide.

Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model.

Client

See the client section to learn how to send requests to the deployment.

To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

Table of Contents