For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
      • Embedding Cache
      • Encoder Disaggregation
      • Multimodal KV Routing
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Key Features
  • Support Matrix
  • Example Workflows
  • Backend Documentation
User Guides

Multimodal Model Serving

Deploy multimodal models with image, video, and audio support in Dynamo
||View as Markdown|
Edit this page
Previous

Dynamo Benchmarking

Next

Embedding Cache

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation (vLLM, SGLang, TRT-LLM) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.

Key Features

Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:

FeatureDescription
Embedding CacheCPU-side LRU cache that skips re-encoding repeated images
Encoder DisaggregationSeparate vision encoder worker for independent scaling
Multimodal KV RoutingMM-aware KV cache routing for optimal worker selection

Support Matrix

StackImageVideoAudio
vLLM✅🧪🧪
TRT-LLM✅❌❌
SGLang✅❌❌

Status: ✅ Supported | 🧪 Experimental | ❌ Not supported

Example Workflows

Reference implementations for deploying multimodal models:

  • vLLM multimodal examples
  • TRT-LLM multimodal examples
  • SGLang multimodal examples
  • Experimental multimodal examples (video, audio)

Backend Documentation

Detailed deployment guides, configuration, and examples for each backend:

  • vLLM Multimodal
  • TensorRT-LLM Multimodal
  • SGLang Multimodal