Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant documentation for each backend for the necessary flags.
This prevents unintended processing of multimodal data from untrusted sources.
* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP (PR #4668)
Pattern Key:
Status: ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
Encoding: Is media encoding handled inline (within prefill) or by a separate Encode Worker?
Prefill/Decode: Are prefill and decode in the same worker or separate?
These combine into four deployment patterns:
All processing happens within a single worker - the simplest setup.
When to use: Quick setup, smaller models, development/testing.
Encoding happens in a separate worker; prefill and decode share the same engine.
When to use: Offload vision encoding to separate GPU, scale encode workers independently.
Full disaggregation with separate workers for encoding, prefill, and decode. There are two variants of this workflow:
Prefill-first:
OR
Decode-first:
When to use: Maximum optimization, multi-node deployment, independent scaling of each phase.
Encoding is combined with prefill, with decode separate.
TRT-LLM’s EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker. For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
When to use: Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
You can find example workflows and reference implementations for deploying multimodal models in: