vLLM-Omni
vLLM-Omni
vLLM-Omni
Dynamo supports multimodal generation through the vLLM-Omni backend. This integration exposes text-to-text, text-to-image, text-to-video, and text-to-audio (TTS) capabilities via OpenAI-compatible API endpoints.
This guide assumes familiarity with deploying Dynamo with vLLM as described in the vLLM backend guide.
Dynamo container images include vLLM-Omni pre-installed. If you are using pip install ai-dynamo[vllm], vLLM-Omni is not included automatically because the matching release is not yet available on PyPI. Install it separately from source, pinning the vLLM-Omni release that matches your installed vLLM version (see the vLLM-Omni releases page):
ARM64 not supported: vLLM-Omni is currently only installed on
amd64builds. Onarm64, the container build skips the install and vLLM-Omni features are unavailable.
The --output-modalities flag determines which endpoint(s) the worker registers. When set to image, both /v1/chat/completions (returns inline base64 images) and /v1/images/generations are available. When set to video, the worker serves /v1/videos. When set to audio, the worker serves /v1/audio/speech.
To run a non-default model, pass --model to any launch script:
Launch an aggregated deployment (frontend + omni worker):
This starts Qwen/Qwen2.5-Omni-7B with a single-stage thinker config on one GPU.
Verify the deployment:
This script uses a custom stage config (stage_configs/single_stage_llm.yaml) that configures the thinker stage for text generation. See Stage Configuration for details.
Launch using the provided script with Qwen/Qwen-Image:
/v1/chat/completionsThe response includes base64-encoded images inline:
/v1/images/generationsLaunch using the provided script with Wan-AI/Wan2.1-T2V-1.3B-Diffusers:
Generate a video via /v1/videos:
The response returns a video URL or base64 data depending on response_format:
The /v1/videos endpoint also accepts NVIDIA extensions via the nvext field for fine-grained control:
Image-to-video (I2V) uses the same /v1/videos endpoint as text-to-video, with an additional input_reference field that provides the source image. The image can be an HTTP URL, a base64 data URI, or a local file path.
Launch with the provided script using Wan-AI/Wan2.2-TI2V-5B-Diffusers:
Generate a video from an image:
The input_reference field accepts:
"https://example.com/image.png""data:image/png;base64,iVBORw0KGgo...""/path/to/image.png" or "file:///path/to/image.png"The I2V-specific nvext fields (boundary_ratio, guidance_scale_2) control the dual-expert MoE denoising schedule in Wan2.x models. See Wan2.2-I2V model card for details.
Launch using the provided script with Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice:
The /v1/audio/speech endpoint follows the vLLM-Omni API format. All TTS-specific parameters are top-level fields:
Available voices and languages are loaded dynamically from the model’s config.json at startup. Non-Qwen3-TTS audio models (e.g., MiMo-Audio) use a generic text prompt and ignore TTS-specific parameters.
The omni backend uses a dedicated entrypoint: python -m dynamo.vllm.omni.
Generated images, videos, and audio files are stored via fsspec, which supports local filesystems, S3, GCS, and Azure Blob.
By default, media is written to the local filesystem at file:///tmp/dynamo_media. To use cloud storage:
When --media-output-http-url is set, response URLs are rewritten as {base-url}/{storage-path} (e.g., https://cdn.example.com/media/videos/req-id.mp4). When unset, the raw filesystem path is returned.
For S3 credential configuration, set the standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or use IAM roles. See the fsspec S3 docs for details.
Omni pipelines are configured via YAML stage configs. See examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml for an example. For full documentation on stage config format and multi-stage pipelines, refer to the vLLM-Omni Stage Configs documentation.
For models with multiple pipeline stages (e.g., AR + Diffusion), Dynamo supports disaggregated serving where each stage runs as an independent process on its own GPU. This enables independent scaling, GPU isolation, and multi-worker replicas per stage.
Each stage runs as an independent process on its own GPU. A lightweight router coordinates them, acting as a pure message broker — it never inspects or transforms inter-stage data.
How it works:
ar2diffusion, thinker2talker), then runs its engine.GLM-Image is a 2-stage text-to-image model with an AR stage (generates prior token IDs) and a DiT stage (diffusion denoising + VAE decode). The built-in vLLM-Omni stage config already assigns each stage to a separate GPU.
Experimental: GLM-Image support is experimental; generation may fail or produce incorrect/garbled outputs for some prompts and sizes.
Test:
Each stage registers independently with Dynamo’s service discovery. To scale a bottleneck stage, launch additional workers with the same --stage-id on different GPUs — the router automatically load-balances across all replicas for that stage. Other stages are unaffected.
input_reference in /v1/videos. Other endpoints accept text prompts only.stream: true) is not yet supported.async_chunk=true (streaming between stages) is not yet supported.