This guide covers deploying FastVideo text-to-video generation on Dynamo using a custom worker (worker.py) exposed through the /v1/videos endpoint.
Dynamo also supports diffusion through built-in backends: SGLang Diffusion (LLM diffusion, image, video), vLLM-Omni (text-to-image, text-to-video), and TRT-LLM Video Diffusion. See the Diffusion Overview for the full support matrix.
FastVideo/LTX2-Distilled-Diffusers — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.torch.compile are enabled by default for maximum throughput.data[0].b64_json (non-streaming).This example is optimized for NVIDIA B200/B300 GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing --disable-optimizations to worker.py, which disables FP4 quantization, torch.compile, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
The local Docker workflow builds a runtime image from the Dockerfile:
nvidia/cuda:13.1.1-devel-ubuntu24.04release/1.0.0 branch (for /v1/videos support)The first Docker image build can take 20–40+ minutes because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling flash-attention can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower MAX_JOBS in the Dockerfile to reduce parallel compile memory usage. The flash-attn install notes specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
On first start, workers download model weights and run compile/warmup steps. Expect roughly 10–20 minutes before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around 35 seconds while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
When using Kubernetes, mount a shared Hugging Face cache PVC (see Kubernetes Deployment) so model weights are downloaded once and reused across pod restarts.
For Docker Compose:
For host-local script:
The Compose file builds from the Dockerfile and exposes the API on http://localhost:8000. See the Docker Image Build section for build time expectations.
Environment variables:
Example:
--disable-optimizations is a worker.py flag (not a dynamo.frontend flag), so pass it through WORKER_EXTRA_ARGS.
The script writes logs to:
.runtime/logs/worker.log.runtime/logs/frontend.logCreate a Hugging Face token secret if needed:
For clusters with tainted user-workload nodes and private registry pulls:
agg_user_workload.yaml.If this is the first request after startup, expect it to take longer while warmup completes. See Warmup Time for details.
Send a request and decode the response:
nvext)The example source lives at examples/diffusers/ in the Dynamo repository.
/v1/videos