Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.
Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
Status: ✅ Supported | 🧪 Experimental | ❌ Not supported
Reference implementations for deploying multimodal models:
Detailed deployment guides, configuration, and examples for each backend: