---
title: Multimodal Model Serving
subtitle: 'Deploy multimodal models with image, video, and audio support in Dynamo'
---

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

<Warning>
**Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.
</Warning>

```mermaid
---
title: Sample flow for an aggregated VLM serving scenario
---
flowchart TD
    A[Request] --> B{KV cache hit?}
    B -->|Yes| C[Use KV]
    B -->|No| D{Embedding cache hit?}
    D -->|Yes| E[Load embedding]
    D -->|No| F[Run encoder]
    F --> G[save to cache]
    G --> H["PREFILL (image tokens + text tokens → KV cache)"]
    E --> H
    C --> I[DECODE]
    H --> I
    I --> J[Response]
```

## Key Features

Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
| Feature | Description |
|---------|-------------|
| **[Embedding Cache](/dynamo/user-guides/multimodal/embedding-cache)** | CPU-side LRU cache that skips re-encoding repeated images |
| **[Encoder Disaggregation](/dynamo/user-guides/multimodal/encoder-disaggregation)** | Separate vision encoder worker for independent scaling |
| **[Multimodal KV Routing](/dynamo/user-guides/multimodal/multimodal-kv-routing)** | MM-aware KV cache routing for optimal worker selection |

## Support Matrix

| Stack | Image | Video | Audio |
|-------|-------|-------|-------|
| **[vLLM](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-vllm.md)** | ✅ | 🧪  | 🧪 |
| **[TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-trtllm.md)** | ✅ | ❌ | ❌ |
| **[SGLang](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-sglang.md)** | ✅ | ❌ | ❌ |

**Status:** ✅ Supported | 🧪 Experimental | ❌ Not supported

## Example Workflows

Reference implementations for deploying multimodal models:

- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.0.1/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.0.1/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.0.1/examples/backends/sglang/launch)
- [Experimental multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.0.1/examples/multimodal/launch) (video, audio)

## Backend Documentation

Detailed deployment guides, configuration, and examples for each backend:

- **[vLLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-vllm.md)**
- **[TensorRT-LLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-trtllm.md)**
- **[SGLang Multimodal](https://github.com/ai-dynamo/dynamo/blob/v1.0.1/docs/features/multimodal/multimodal-sglang.md)**
