Multimodal Model Serving | NVIDIA Dynamo Documentation

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation (vLLM, SGLang, TRT-LLM) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.

Key Features

Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:

Feature	Description
Embedding Cache	CPU-side LRU cache that skips re-encoding repeated images
Encoder Disaggregation	Separate vision encoder worker for independent scaling
Multimodal KV Routing	MM-aware KV cache routing for optimal worker selection

Support Matrix

Stack	Image	Video	Audio
vLLM	✅	🧪	🧪
TRT-LLM	✅	❌	❌
SGLang	✅	❌	❌

Status: ✅ Supported | 🧪 Experimental | ❌ Not supported

Example Workflows

Reference implementations for deploying multimodal models:

vLLM multimodal examples
TRT-LLM multimodal examples
SGLang multimodal examples
Experimental multimodal examples (video, audio)

Backend Documentation

Detailed deployment guides, configuration, and examples for each backend: