Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.
Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
Status: ✅ Supported | 🧪 Experimental | ❌ Not supported
All multimodal loaders route remote fetches through a shared URL policy
(dynamo.common.multimodal.url_validator). Only
https:// and data: URLs are allowed by default, private / internal IPs are blocked,
and local file access is disabled. Every HTTP redirect hop is re-validated
against the policy.
Two environment variables loosen the defaults for non-public deployments:
Never set DYN_MM_ALLOW_INTERNAL=1 on public-facing deployments. It opens SSRF paths to cloud metadata endpoints (AWS IMDS, GCE, Azure) and other internal services.
Reference implementations for deploying multimodal models:
Detailed deployment guides, configuration, and examples for each backend: