TensorRT-LLM Multimodal

View as Markdown

This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.

You can provide multimodal inputs in the following ways:

  • By sending image URLs
  • By providing paths to pre-computed embedding files

Note: You should provide either image URLs or embedding file paths in a single request.

Support Matrix

ModalityInput FormatAggregatedDisaggregatedNotes
ImageHTTP/HTTPS URLYesYesFull support for all image models
ImagePre-computed Embeddings (.pt, .pth, .bin)YesYesDirect embedding files
VideoHTTP/HTTPS URLNoNoNot implemented
AudioHTTP/HTTPS URLNoNoNot implemented

Supported URL Formats

FormatExampleDescription
HTTP/HTTPShttp://example.com/image.jpgRemote media files
Pre-computed Embeddings/path/to/embedding.ptLocal embedding files (.pt, .pth, .bin)

Deployment Patterns

TRT-LLM supports aggregated and traditional disaggregated patterns. See Architecture Patterns for detailed explanations.

PatternSupportedLaunch ScriptNotes
Aggregatedagg.shEasiest setup, single worker
EP/D (Traditional Disaggregated)disagg_multimodal.shPrefill handles encoding, 2 workers
E/P/D (Full - Image URLs)epd_multimodal_image_and_embeddings.shStandalone encoder with MultimodalEncoder, 3 workers
E/P/D (Full - Pre-computed Embeddings)epd_multimodal_image_and_embeddings.shStandalone encoder with NIXL transfer, 3 workers
E/P/D (Large Models)epd_disagg.shFor Llama-4 Scout/Maverick, multi-node

Component Flags

ComponentFlagPurpose
Worker--modality multimodalComplete pipeline (aggregated)
Prefill Worker--disaggregation-mode prefillImage processing + Prefill (multimodal tokenization happens here)
Decode Worker--disaggregation-mode decodeDecode only
Encode Worker--disaggregation-mode encodeImage encoding (E/P/D flow)

Aggregated Serving

Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:

$cd $DYNAMO_HOME
$
$export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
$export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$./examples/backends/trtllm/launch/agg.sh

Client:

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe the image"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "stream": false,
> "max_tokens": 160
>}'

Disaggregated Serving

Example using Qwen/Qwen2-VL-7B-Instruct:

$cd $DYNAMO_HOME
$
$export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
$export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
$export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
$export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
$export MODALITY="multimodal"
$
$./examples/backends/trtllm/launch/disagg.sh
$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "Qwen/Qwen2-VL-7B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe the image"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "stream": false,
> "max_tokens": 160
>}'

For a large model like meta-llama/Llama-4-Maverick-17B-128E-Instruct, a multi-node setup is required for disaggregated serving (see Multi-node Deployment below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node’s GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.

Full E/P/D Flow (Image URLs)

For high-performance multimodal inference, Dynamo supports a standalone encoder with an Encode-Prefill-Decode (E/P/D) flow using TRT-LLM’s MultimodalEncoder. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.

Supported Input Formats

FormatExampleDescription
HTTP/HTTPS URLhttps://example.com/image.jpgRemote image files
Base64 Data URLdata:image/jpeg;base64,...Inline base64-encoded images

How It Works

In the full E/P/D flow:

  1. Encode Worker: Runs TRT-LLM’s MultimodalEncoder.generate() to process image URLs through the vision encoder and projector
  2. Prefill Worker: Receives disaggregated_params containing multimodal embedding handles, processes context and generates KV cache
  3. Decode Worker: Performs streaming token generation using the KV cache

The encode worker uses TRT-LLM’s MultimodalEncoder class (which inherits from BaseLLM) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.

How to Launch

$cd $DYNAMO_HOME
$
$# Launch 3-worker E/P/D flow with image URL support
$./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh

Example Request

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "llava-v1.6-mistral-7b-hf",
> "messages": [
> {
> "role": "user",
> "content": [
> {"type": "text", "text": "Describe the image"},
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "max_tokens": 160
>}'

E/P/D Architecture (Image URLs)

Key Differences from EP/D (Traditional Disaggregated)

AspectEP/D (Traditional)E/P/D (Full)
EncodingPrefill worker handles image encodingDedicated encode worker
Prefill LoadHigher (encoding + prefill)Lower (prefill only)
Use CaseSimpler setupBetter scalability for vision-heavy workloads
Launch Scriptdisagg_multimodal.shepd_multimodal_image_and_embeddings.sh

Pre-computed Embeddings with E/P/D Flow

For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an Encode-Prefill-Decode (E/P/D) flow using NIXL (RDMA) for zero-copy tensor transfer.

Supported File Types

  • .pt - PyTorch tensor files
  • .pth - PyTorch checkpoint files
  • .bin - Binary tensor files

Embedding File Formats

TRT-LLM supports two formats for embedding files:

1. Simple Tensor Format

Direct tensor saved as .pt file containing only the embedding tensor:

1embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
2torch.save(embedding_tensor, "embedding.pt")

2. Dictionary Format with Auxiliary Data

Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:

1embedding_dict = {
2 "mm_embeddings": torch.rand(1, 576, 4096),
3 "special_tokens": [128256, 128257],
4 "image_token_offsets": [[0, 576]],
5 # ... other model-specific metadata
6}
7torch.save(embedding_dict, "llama4_embedding.pt")
  • Simple tensors: Loaded directly and passed to mm_embeddings parameter
  • Dictionary format: mm_embeddings key extracted as main tensor, other keys preserved as auxiliary data

How to Launch

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$# Launch 3-worker E/P/D flow with NIXL
$./launch/epd_disagg.sh

Note: This script is designed for 8-node H200 with Llama-4-Scout-17B-16E-Instruct model and assumes you have a model-specific embedding file ready.

Configuration

$# Encode endpoint for Prefill → Encode communication
$export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
$
$# Security: Allowed directory for embedding files (default: /tmp)
$export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
$
$# Security: Max file size to prevent DoS attacks (default: 50MB)
$export MAX_FILE_SIZE_MB=50

Example Request with Pre-computed Embeddings

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {"type": "text", "text": "Describe the image"},
> {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
> ]
> }
> ],
> "max_tokens": 160
>}'

E/P/D Architecture

The E/P/D flow implements a 3-worker architecture:

  • Encode Worker: Loads pre-computed embeddings, transfers via NIXL
  • Prefill Worker: Receives embeddings, handles context processing and KV-cache generation
  • Decode Worker: Performs streaming token generation

Multi-node Deployment (Slurm)

This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.

Note: The scripts referenced in this section can be found in examples/basics/multinode/trtllm/.

Environment Setup

Assuming you have allocated your nodes via salloc and are inside an interactive shell:

$# Container image (build using docs/backends/trtllm/README.md#build-container)
$export IMAGE="<dynamo_trtllm_image>"
$
$# Host:container path pairs for mounting
$export MOUNTS="${PWD}/../../../../:/mnt"
$
$# Model configuration
$export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$export MODALITY=${MODALITY:-"multimodal"}

Multi-node Disaggregated Launch

For 4 4xGB200 nodes (2 for prefill, 2 for decode):

$# Customize parallelism to match your engine configs
$# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
$# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
$# export NUM_PREFILL_NODES=2
$# export NUM_DECODE_NODES=2
$# export NUM_GPUS_PER_NODE=4
$
$# Launches frontend + etcd/nats on head node, plus prefill and decode workers
$./srun_disaggregated.sh

Understanding the Output

  1. srun_disaggregated.sh launches three srun jobs: frontend, prefill worker, and decode worker
  2. The OpenAI frontend will dynamically discover workers as they register:
    INFO dynamo_run::input::http: Watching for remote model at models
    INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
  3. TRT-LLM workers output progress from each MPI rank while loading
  4. When ready, the frontend logs:
    INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"

Cleanup

$pkill srun

Embedding Cache

Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:

SettingImplementationLaunch ScriptStatus
Disaggregated EncoderDynamo-managed cache in the PD worker layer on top of TRT-LLM enginedisagg_e_pd.sh + --multimodal-embedding-cache-capacity-gbSupported
AggregatedN/AN/ANot yet supported

The cache uses MultimodalEmbeddingCacheManager to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.

Disaggregated Encoder (Embedding Cache in Prefill Worker)

In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (EmbeddingCacheManager). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.

The disagg_e_pd.sh script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing --multimodal-embedding-cache-capacity-gb:

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

NIXL Usage

Use CaseScriptNIXL Used?Data Transfer
Aggregatedagg.shNoAll in one worker
EP/D (Traditional Disaggregated)disagg_multimodal.shOptionalPrefill → Decode (KV cache via UCX or NIXL)
E/P/D (Image URLs)epd_multimodal_image_and_embeddings.shNoEncoder → Prefill (handles via params), Prefill → Decode (KV cache)
E/P/D (Pre-computed Embeddings)epd_multimodal_image_and_embeddings.shYesEncoder → Prefill (embeddings via NIXL RDMA)
E/P/D (Large Models)epd_disagg.shYesEncoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache)

Note: NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.

ModelInput Types and Registration

TRT-LLM workers register with Dynamo using:

ModelInput TypePreprocessingUse Case
ModelInput.TokensRust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignoredAll TRT-LLM workers
1# TRT-LLM Worker - Register with Tokens
2await register_model(
3 ModelInput.Tokens, # Rust does minimal preprocessing
4 model_type, # ModelType.Chat or ModelType.Prefill
5 generate_endpoint,
6 model_name,
7 ...
8)

Inter-Component Communication

Transfer StageMessageNIXL Transfer
Frontend → PrefillRequest with image URL or embedding pathNo
Prefill → Encode (Image URL)Request with image URLNo
Encode → Prefill (Image URL)ep_disaggregated_params with multimodal_embedding_handles, processed prompt, and token IDsNo
Prefill → Encode (Embedding Path)Request with embedding file pathNo
Encode → Prefill (Embedding Path)NIXL readable metadata + shape/dtype + auxiliary dataYes (Embeddings tensor via RDMA)
Prefill → Decodedisaggregated_params with _epd_metadata (prompt, token IDs)Configurable (KV cache: NIXL default, UCX optional)

Known Limitations

  • No video support - No video encoder implementation
  • No audio support - No audio encoder implementation
  • Multimodal preprocessing/tokenization happens in Python - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
  • Multi-node H100 limitation - Loading meta-llama/Llama-4-Maverick-17B-128E-Instruct with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (num_attention_heads: 40 not divisible by tp_size: 16)
  • llava-v1.6-mistral-7b-hf model crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. To use Llava model download revision revision='52320fb52229 locally using HF.
  • Embeddings file crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. Embedding file parsing crashes in attach_multimodal_embeddings(. To be fixed in next TRTLLM upgrade.

Supported Models

Multimodal models listed in TensorRT-LLM supported models are supported by Dynamo.

Common examples:

  • Llama 4 Vision models (Maverick, Scout) - Recommended for large-scale deployments
  • LLaVA models (e.g., llava-hf/llava-v1.6-mistral-7b-hf) - Default model for E/P/D examples
  • Qwen2-VL models - Supported in traditional disaggregated mode
  • Other vision-language models with TRT-LLM support

Key Files

FileDescription
components/src/dynamo/trtllm/main.pyWorker initialization and setup
components/src/dynamo/trtllm/engine.pyTensorRTLLMEngine wrapper (LLM and MultimodalEncoder)
components/src/dynamo/trtllm/constants.pyDisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE)
components/src/dynamo/trtllm/encode_helper.pyEncode worker request processing (embedding-path and full EPD flows)
components/src/dynamo/trtllm/multimodal_processor.pyMultimodal request processing
components/src/dynamo/trtllm/request_handlers/handlers.pyRequest handlers (EncodeHandler, PrefillHandler, DecodeHandler)
components/src/dynamo/trtllm/request_handlers/handler_base.pyBase handler with disaggregated params encoding/decoding
components/src/dynamo/trtllm/utils/disagg_utils.pyDisaggregatedParamsCodec for network transfer
components/src/dynamo/trtllm/utils/trtllm_utils.pyCommand-line argument parsing