Multimodal Model Serving

Deploy multimodal models with image, video, and audio support in Dynamo
View as Markdown

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation (vLLM, SGLang, TRT-LLM) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.

Key Features

Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:

FeatureDescription
Embedding CacheCPU-side LRU cache that skips re-encoding repeated images
Encoder DisaggregationSeparate vision encoder worker for independent scaling
Multimodal KV RoutingMM-aware KV cache routing for optimal worker selection

Support Matrix

StackImageVideoAudio
vLLM🧪🧪
TRT-LLM
SGLang

Status: ✅ Supported | 🧪 Experimental | ❌ Not supported

Example Workflows

Reference implementations for deploying multimodal models:

Backend Documentation

Detailed deployment guides, configuration, and examples for each backend: