TensorRT-LLM Multimodal | NVIDIA Dynamo Documentation

This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.

You can provide multimodal inputs in the following ways:

By sending image URLs
By providing paths to pre-computed embedding files

Note: You should provide either image URLs or embedding file paths in a single request.

Support Matrix

Modality	Input Format	Aggregated	Disaggregated	Notes
Image	HTTP/HTTPS URL	Yes	Yes	Full support for all image models
Image	Pre-computed Embeddings (.pt, .pth, .bin)	Yes	Yes	Direct embedding files
Video	HTTP/HTTPS URL	No	No	Not implemented
Audio	HTTP/HTTPS URL	No	No	Not implemented

Supported URL Formats

Format	Example	Description
HTTP/HTTPS	`http://example.com/image.jpg`	Remote media files
Pre-computed Embeddings	`/path/to/embedding.pt`	Local embedding files (.pt, .pth, .bin)

Deployment Patterns

TRT-LLM supports aggregated and traditional disaggregated patterns. See Architecture Patterns for detailed explanations.

Pattern	Supported	Launch Script	Notes
Aggregated	✅	`agg.sh`	Easiest setup, single worker
EP/D (Traditional Disaggregated)	✅	`disagg_multimodal.sh`	Prefill handles encoding, 2 workers
E/P/D (Full - Image URLs)	✅	`epd_multimodal_image_and_embeddings.sh`	Standalone encoder with `MultimodalEncoder`, 3 workers
E/P/D (Full - Pre-computed Embeddings)	✅	`epd_multimodal_image_and_embeddings.sh`	Standalone encoder with NIXL transfer, 3 workers
E/P/D (Large Models)	✅	`epd_disagg.sh`	For Llama-4 Scout/Maverick, multi-node

Component Flags

Component	Flag	Purpose
Worker	`--modality multimodal`	Complete pipeline (aggregated)
Prefill Worker	`--disaggregation-mode prefill`	Image processing + Prefill (multimodal tokenization happens here)
Decode Worker	`--disaggregation-mode decode`	Decode only
Encode Worker	`--disaggregation-mode encode`	Image encoding (E/P/D flow)

Aggregated Serving

Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:

$ cd $DYNAMO_HOME
$ 
$ export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
$ export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$ export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$ ./examples/backends/trtllm/launch/agg.sh

Client:

$ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
>     "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
>     "messages": [
>         {
>             "role": "user",
>             "content": [
>                 {
>                     "type": "text",
>                     "text": "Describe the image"
>                 },
>                 {
>                     "type": "image_url",
>                     "image_url": {
>                         "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
>                     }
>                 }
>             ]
>         }
>     ],
>     "stream": false,
>     "max_tokens": 160
> }'

Disaggregated Serving

Example using Qwen/Qwen2-VL-7B-Instruct:

$ cd $DYNAMO_HOME
$ 
$ export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
$ export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
$ export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
$ export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
$ export MODALITY="multimodal"
$ 
$ ./examples/backends/trtllm/launch/disagg.sh

$ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
>     "model": "Qwen/Qwen2-VL-7B-Instruct",
>     "messages": [
>         {
>             "role": "user",
>             "content": [
>                 {
>                     "type": "text",
>                     "text": "Describe the image"
>                 },
>                 {
>                     "type": "image_url",
>                     "image_url": {
>                         "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
>                     }
>                 }
>             ]
>         }
>     ],
>     "stream": false,
>     "max_tokens": 160
> }'

For a large model like meta-llama/Llama-4-Maverick-17B-128E-Instruct, a multi-node setup is required for disaggregated serving (see Multi-node Deployment below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node’s GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.

Full E/P/D Flow (Image URLs)

For high-performance multimodal inference, Dynamo supports a standalone encoder with an Encode-Prefill-Decode (E/P/D) flow using TRT-LLM’s MultimodalEncoder. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.

Supported Input Formats

Format	Example	Description
HTTP/HTTPS URL	`https://example.com/image.jpg`	Remote image files
Base64 Data URL	`data:image/jpeg;base64,...`	Inline base64-encoded images

How It Works

In the full E/P/D flow:

Encode Worker: Runs TRT-LLM’s MultimodalEncoder.generate() to process image URLs through the vision encoder and projector
Prefill Worker: Receives disaggregated_params containing multimodal embedding handles, processes context and generates KV cache
Decode Worker: Performs streaming token generation using the KV cache

The encode worker uses TRT-LLM’s MultimodalEncoder class (which inherits from BaseLLM) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.

How to Launch

$ cd $DYNAMO_HOME
$ 
$ # Launch 3-worker E/P/D flow with image URL support
$ ./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh

Example Request

$ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
>     "model": "llava-v1.6-mistral-7b-hf",
>     "messages": [
>         {
>             "role": "user",
>             "content": [
>                 {"type": "text", "text": "Describe the image"},
>                 {
>                     "type": "image_url",
>                     "image_url": {
>                         "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
>                     }
>                 }
>             ]
>         }
>     ],
>     "max_tokens": 160
> }'

E/P/D Architecture (Image URLs)

Key Differences from EP/D (Traditional Disaggregated)

Aspect	EP/D (Traditional)	E/P/D (Full)
Encoding	Prefill worker handles image encoding	Dedicated encode worker
Prefill Load	Higher (encoding + prefill)	Lower (prefill only)
Use Case	Simpler setup	Better scalability for vision-heavy workloads
Launch Script	`disagg_multimodal.sh`	`epd_multimodal_image_and_embeddings.sh`

Pre-computed Embeddings with E/P/D Flow

For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an Encode-Prefill-Decode (E/P/D) flow using NIXL (RDMA) for zero-copy tensor transfer.

Supported File Types

.pt - PyTorch tensor files
.pth - PyTorch checkpoint files
.bin - Binary tensor files

Embedding File Formats

TRT-LLM supports two formats for embedding files:

1. Simple Tensor Format

Direct tensor saved as .pt file containing only the embedding tensor:

1 embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
2 torch.save(embedding_tensor, "embedding.pt")

2. Dictionary Format with Auxiliary Data

Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:

1 embedding_dict = {
2     "mm_embeddings": torch.rand(1, 576, 4096),
3     "special_tokens": [128256, 128257],
4     "image_token_offsets": [[0, 576]],
5     # ... other model-specific metadata
6 }
7 torch.save(embedding_dict, "llama4_embedding.pt")

Simple tensors: Loaded directly and passed to mm_embeddings parameter
Dictionary format: mm_embeddings key extracted as main tensor, other keys preserved as auxiliary data

How to Launch

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ 
$ # Launch 3-worker E/P/D flow with NIXL
$ ./launch/epd_disagg.sh

Note: This script is designed for 8-node H200 with Llama-4-Scout-17B-16E-Instruct model and assumes you have a model-specific embedding file ready.

Configuration

$ # Encode endpoint for Prefill → Encode communication
$ export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
$ 
$ # Security: Allowed directory for embedding files (default: /tmp)
$ export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
$ 
$ # Security: Max file size to prevent DoS attacks (default: 50MB)
$ export MAX_FILE_SIZE_MB=50

Example Request with Pre-computed Embeddings

$ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
>     "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
>     "messages": [
>         {
>             "role": "user",
>             "content": [
>                 {"type": "text", "text": "Describe the image"},
>                 {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
>             ]
>         }
>     ],
>     "max_tokens": 160
> }'

E/P/D Architecture

The E/P/D flow implements a 3-worker architecture:

Encode Worker: Loads pre-computed embeddings, transfers via NIXL
Prefill Worker: Receives embeddings, handles context processing and KV-cache generation
Decode Worker: Performs streaming token generation

Multi-node Deployment (Slurm)

This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.

Note: The scripts referenced in this section can be found in examples/basics/multinode/trtllm/.

Environment Setup

Assuming you have allocated your nodes via salloc and are inside an interactive shell:

$ # Container image (build using docs/backends/trtllm/README.md#build-container)
$ export IMAGE="<dynamo_trtllm_image>"
$ 
$ # Host:container path pairs for mounting
$ export MOUNTS="${PWD}/../../../../:/mnt"
$ 
$ # Model configuration
$ export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$ export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$ export MODALITY=${MODALITY:-"multimodal"}

Multi-node Disaggregated Launch

For 4 4xGB200 nodes (2 for prefill, 2 for decode):

$ # Customize parallelism to match your engine configs
$ # export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
$ # export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
$ # export NUM_PREFILL_NODES=2
$ # export NUM_DECODE_NODES=2
$ # export NUM_GPUS_PER_NODE=4
$ 
$ # Launches frontend + etcd/nats on head node, plus prefill and decode workers
$ ./srun_disaggregated.sh

Understanding the Output

srun_disaggregated.sh launches three srun jobs: frontend, prefill worker, and decode worker

The OpenAI frontend will dynamically discover workers as they register:

INFO dynamo_run::input::http: Watching for remote model at models
INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000

TRT-LLM workers output progress from each MPI rank while loading

When ready, the frontend logs:

INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"

Cleanup

$ pkill srun

Embedding Cache

Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:

Setting	Implementation	Launch Script	Status
Disaggregated Encoder	Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine	`disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb`	Supported
Aggregated	N/A	N/A	Not yet supported

The cache uses MultimodalEmbeddingCacheManager to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.

Disaggregated Encoder (Embedding Cache in Prefill Worker)

In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (EmbeddingCacheManager). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.

The disagg_e_pd.sh script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing --multimodal-embedding-cache-capacity-gb:

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ ./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

NIXL Usage

Use Case	Script	NIXL Used?	Data Transfer
Aggregated	`agg.sh`	No	All in one worker
EP/D (Traditional Disaggregated)	`disagg_multimodal.sh`	Optional	Prefill → Decode (KV cache via UCX or NIXL)
E/P/D (Image URLs)	`epd_multimodal_image_and_embeddings.sh`	No	Encoder → Prefill (handles via params), Prefill → Decode (KV cache)
E/P/D (Pre-computed Embeddings)	`epd_multimodal_image_and_embeddings.sh`	Yes	Encoder → Prefill (embeddings via NIXL RDMA)
E/P/D (Large Models)	`epd_disagg.sh`	Yes	Encoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache)

Note: NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.

ModelInput Types and Registration

TRT-LLM workers register with Dynamo using:

ModelInput Type	Preprocessing	Use Case
`ModelInput.Tokens`	Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored	All TRT-LLM workers

1 # TRT-LLM Worker - Register with Tokens
2 await register_model(
3     ModelInput.Tokens,      # Rust does minimal preprocessing
4     model_type,             # ModelType.Chat or ModelType.Prefill
5     generate_endpoint,
6     model_name,
7     ...
8 )

Inter-Component Communication

Transfer Stage	Message	NIXL Transfer
Frontend → Prefill	Request with image URL or embedding path	No
Prefill → Encode (Image URL)	Request with image URL	No
Encode → Prefill (Image URL)	`ep_disaggregated_params` with `multimodal_embedding_handles`, processed prompt, and token IDs	No
Prefill → Encode (Embedding Path)	Request with embedding file path	No
Encode → Prefill (Embedding Path)	NIXL readable metadata + shape/dtype + auxiliary data	Yes (Embeddings tensor via RDMA)
Prefill → Decode	`disaggregated_params` with `_epd_metadata` (prompt, token IDs)	Configurable (KV cache: NIXL default, UCX optional)

Known Limitations

No video support - No video encoder implementation
No audio support - No audio encoder implementation
Multimodal preprocessing/tokenization happens in Python - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
Multi-node H100 limitation - Loading meta-llama/Llama-4-Maverick-17B-128E-Instruct with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (num_attention_heads: 40 not divisible by tp_size: 16)
llava-v1.6-mistral-7b-hf model crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. To use Llava model download revision revision='52320fb52229 locally using HF.
Embeddings file crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. Embedding file parsing crashes in attach_multimodal_embeddings(. To be fixed in next TRTLLM upgrade.

Supported Models

Multimodal models listed in TensorRT-LLM supported models are supported by Dynamo.

Common examples:

Llama 4 Vision models (Maverick, Scout) - Recommended for large-scale deployments
LLaVA models (e.g., llava-hf/llava-v1.6-mistral-7b-hf) - Default model for E/P/D examples
Qwen2-VL models - Supported in traditional disaggregated mode
Other vision-language models with TRT-LLM support

Key Files

File	Description
`components/src/dynamo/trtllm/main.py`	Worker initialization and setup
`components/src/dynamo/trtllm/engine.py`	TensorRTLLMEngine wrapper (LLM and MultimodalEncoder)
`components/src/dynamo/trtllm/constants.py`	DisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE)
`components/src/dynamo/trtllm/encode_helper.py`	Encode worker request processing (embedding-path and full EPD flows)
`components/src/dynamo/trtllm/multimodal_processor.py`	Multimodal request processing
`components/src/dynamo/trtllm/request_handlers/handlers.py`	Request handlers (EncodeHandler, PrefillHandler, DecodeHandler)
`components/src/dynamo/trtllm/request_handlers/handler_base.py`	Base handler with disaggregated params encoding/decoding
`components/src/dynamo/trtllm/utils/disagg_utils.py`	DisaggregatedParamsCodec for network transfer
`components/src/dynamo/trtllm/utils/trtllm_utils.py`	Command-line argument parsing