Multimodal Support#
TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs in the following ways:
By sending image URLs
By providing paths to pre-computed embedding files
Please note that you should provide either image URLs or embedding file paths in a single request.
Aggregated#
Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
cd $DYNAMO_HOME
export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
./launch/agg.sh
Example Requests#
With Image URL#
Below is an example of an image being sent to Llama-4-Maverick-17B-128E-Instruct model
Request :
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"stream": false,
"max_tokens": 160
}'
Response :
{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
Disaggregated#
Here are quick steps to launch in disaggregated mode.
The following is an example of launching a model in disaggregated mode. While this example uses Qwen/Qwen2-VL-7B-Instruct, you can adapt it for other models by modifying the environment variables for the model path and engine configurations.
cd $DYNAMO_HOME
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen2-VL-7B-Instruct"}
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen2-VL-7B-Instruct"}
export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"}
export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"}
export MODALITY=${MODALITY:-"multimodal"}
./launch/disagg.sh
For a large model like meta-llama/Llama-4-Maverick-17B-128E-Instruct, a multi-node setup is required for disaggregated serving, while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node’s GPUs. For instance, running this model in disaggregated mode requires a setup of 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
In general, disaggregated serving can run on a single node, provided the model fits on the GPU. The multi-node requirement in this example is specific to the size and configuration of the meta-llama/Llama-4-Maverick-17B-128E-Instruct model.
To deploy Llama-4-Maverick-17B-128E-Instruct in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found here.
Pre-computed Embeddings with EPD Flow#
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an Encode-Prefill-Decode (EPD) flow using NIXL (RDMA) for zero-copy tensor transfer.
Supported File Types#
.pt- PyTorch tensor files.pth- PyTorch checkpoint files.bin- Binary tensor files
How to Launch#
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
Note: This script is designed for 8-node H200 with
Llama-4-Scout-17B-16E-Instructmodel and assumes you have a model-specific embedding file ready.
Configuration#
# Encode endpoint for Prefill → Encode communication
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
# Security: Allowed directory for embedding files (default: /tmp)
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
# Security: Max file size to prevent DoS attacks (default: 50MB)
export MAX_FILE_SIZE_MB=50
Example Request#
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
]
}
],
"max_tokens": 160
}'
Architecture#
The EPD flow implements a 3-worker architecture:
Encode Worker: Loads pre-computed embeddings, transfers via NIXL
Prefill Worker: Receives embeddings, handles context processing and KV-cache generation
Decode Worker: Performs streaming token generation
Request Flow#
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker"
participant EncodeWorker as "Encode Worker"
participant DecodeWorker as "Decode Worker"
participant NIXL as "NIXL (RDMA)"
Client->>Frontend: POST /v1/chat/completions
Frontend->>PrefillWorker: Route to prefill worker
PrefillWorker->>EncodeWorker: Send request (embedding paths)
EncodeWorker->>NIXL: Create readable operation
EncodeWorker->>PrefillWorker: Send metadata + NIXL info
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete
PrefillWorker->>Frontend: Return prefill response
Frontend->>DecodeWorker: Route to decode worker
DecodeWorker->>Frontend: Stream response chunks
Frontend->>Client: Stream response
Supported Multimodal Models#
Multimodal models listed in TensorRT-LLM supported models are supported by Dynamo.