Multimodal Model Serving
Deploy multimodal models with image, video, and audio support in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.
Key Features
Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
Support Matrix
Status: ✅ Supported | 🧪 Experimental | ❌ Not supported
Example Workflows
Reference implementations for deploying multimodal models:
- vLLM multimodal examples
- TRT-LLM multimodal examples
- SGLang multimodal examples
- Experimental multimodal examples (video, audio)
Backend Documentation
Detailed deployment guides, configuration, and examples for each backend: