TensorRT-LLM Multimodal
TensorRT-LLM Multimodal
TensorRT-LLM Multimodal
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
You can provide multimodal inputs in the following ways:
You should provide either image URLs or embedding file paths in a single request.
TRT-LLM supports aggregated and traditional disaggregated patterns. See Architecture Patterns for detailed explanations.
Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:
Client:
Example using Qwen/Qwen2-VL-7B-Instruct:
For a large model like meta-llama/Llama-4-Maverick-17B-128E-Instruct, a multi-node setup is required for disaggregated serving (see Multi-node Deployment below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node’s GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an Encode-Prefill-Decode (E/P/D) flow using NIXL (RDMA) for zero-copy tensor transfer.
.pt - PyTorch tensor files.pth - PyTorch checkpoint files.bin - Binary tensor filesTRT-LLM supports two formats for embedding files:
1. Simple Tensor Format
Direct tensor saved as .pt file containing only the embedding tensor:
2. Dictionary Format with Auxiliary Data
Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:
mm_embeddings parametermm_embeddings key extracted as main tensor, other keys preserved as auxiliary dataThis script is designed for 8-node H200 with Llama-4-Scout-17B-16E-Instruct model and assumes you have a model-specific embedding file ready.
The E/P/D flow implements a 3-worker architecture:
This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
The scripts referenced in this section can be found in examples/basics/multinode/trtllm/.
Assuming you have allocated your nodes via salloc and are inside an interactive shell:
For 4 4xGB200 nodes (2 for prefill, 2 for decode):
srun_disaggregated.sh launches three srun jobs: frontend, prefill worker, and decode workerNIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
TRT-LLM workers register with Dynamo using:
data:image/... base64 URLs not supportedmeta-llama/Llama-4-Maverick-17B-128E-Instruct with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (num_attention_heads: 40 not divisible by tp_size: 16)Multimodal models listed in TensorRT-LLM supported models are supported by Dynamo.
Common examples: