For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Installation
    • Support Matrix
    • Feature Matrix
    • Examples
  • Kubernetes Deployment
  • User Guides
    • Tool Calling
    • Multimodality Support
    • Finding Best Initial Configs
    • Dynamo Benchmarking Guide
    • Tuning Disaggregated Performance
    • Writing Python Workers in Dynamo
    • Glossary
  • Components
    • Router
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Backend Documentation
  • Support Matrix
  • Backend Capabilities
  • Input Format Support
  • Architecture Patterns
  • EPD - Simple Aggregated
  • E/PD - Encode Separate
  • E/P/D - Full Disaggregation
  • EP/D - Traditional Disaggregated
  • Example Workflows
User Guides

Multimodal Inference in Dynamo

||View as Markdown|
Previous

Tool Calling with Dynamo

Next

Finding Best Initial Configs using AIConfigurator

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.

Security Requirement: Multimodal processing must be explicitly enabled at startup. See the relevant documentation for each backend for the necessary flags.

This prevents unintended processing of multimodal data from untrusted sources.

Backend Documentation

1:maxdepth: 1
2
3vLLM Multimodal <vllm.md>
4TensorRT-LLM Multimodal <trtllm.md>
5SGLang Multimodal <sglang.md>

Support Matrix

Backend Capabilities

StackE/PDE/P/DEP/DEPDImageVideoAudio
vLLM✅✅✅✅✅✅🧪
TRT-LLM❌🚧*✅✅✅❌❌
SGLang✅✅❌❌✅❌❌

* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP (PR #4668)

Pattern Key:

  • EPD - All-in-one worker (Simple Aggregated)
  • E/PD - Separate encode, combined prefill+decode
  • E/P/D - All stages separate
  • EP/D - Combined encode+prefill, separate decode

Status: ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported

Input Format Support

FormatvLLMTRT-LLMSGLang
HTTP/HTTPS URL✅✅✅
Data URL (Base64)✅❌❌
Pre-computed Embeddings (.pt)❌✅❌

Architecture Patterns

Dynamo supports several deployment patterns for multimodal inference based on two dimensions:

  1. Encoding: Is media encoding handled inline (within prefill) or by a separate Encode Worker?

    • Inline: Simpler setup, encoding happens in the prefill worker
    • Separate (EPD): Dedicated encode worker transfers embeddings via NIXL (RDMA), enabling independent scaling
  2. Prefill/Decode: Are prefill and decode in the same worker or separate?

    • Aggregated: Single worker handles both prefill and decode
    • Disaggregated: Separate workers for prefill and decode, with KV cache transfer between them

These combine into four deployment patterns:

EPD - Simple Aggregated

All processing happens within a single worker - the simplest setup.

HTTP Frontend (Rust)
↓
Worker (Python)
↓ image load + encode + prefill + decode
Response
ComponentPurpose
Frontend (Rust)HTTP entry point, tokenization, image URL preprocessing
WorkerComplete inference pipeline (encode + prefill + decode)

When to use: Quick setup, smaller models, development/testing.

E/PD - Encode Separate

Encoding happens in a separate worker; prefill and decode share the same engine.

HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
ComponentPurpose
Frontend (Rust)HTTP entry point
Processor (Python)Tokenization, extracts media URLs
Encode WorkerMedia encoding, embeddings generation
PD WorkerPrefill + Decode with embeddings

When to use: Offload vision encoding to separate GPU, scale encode workers independently.

E/P/D - Full Disaggregation

Full disaggregation with separate workers for encoding, prefill, and decode. There are two variants of this workflow:

  • Prefill-first, used by vLLM
  • Decode-first, used by SGlang

Prefill-first:

HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response

OR

Decode-first:

HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
ComponentPurpose
Frontend (Rust)HTTP entry point
Processor (Python)Tokenization, extracts media URLs
Encode WorkerMedia encoding, embeddings generation
Prefill WorkerPrefill only, transfers KV cache
Decode WorkerDecode only, token generation

When to use: Maximum optimization, multi-node deployment, independent scaling of each phase.

EP/D - Traditional Disaggregated

Encoding is combined with prefill, with decode separate.

HTTP Frontend (Rust)
↓
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
ComponentPurpose
Frontend (Rust)HTTP entry point
Processor (Python)Tokenization, extracts media URLs (vLLM only)
Encode+Prefill WorkerCombined encoding and prefill
Decode WorkerDecode only, token generation

TRT-LLM’s EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker. For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.

When to use: Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.

Example Workflows

You can find example workflows and reference implementations for deploying multimodal models in:

  • vLLM multimodal examples
  • TRT-LLM multimodal examples
  • SGLang multimodal examples
  • Advanced multimodal examples (video, audio)