This guide covers deploying FastVideo text-to-video generation on Dynamo using a custom worker (worker.py) exposed through the /v1/videos endpoint.
Dynamo also supports diffusion through built-in backends: SGLang Diffusion (LLM diffusion, image, video), vLLM-Omni (text-to-image, text-to-video), and TRT-LLM Diffusion (text-to-image, text-to-video). See the Diffusion Overview for the full support matrix.
FastVideo/LTX2-Distilled-Diffusers — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.torch.compile are available via --enable-optimizations; attention backend selection is controlled separately via --attention-backend.data[0].b64_json (non-streaming).worker.py defaults to --attention-backend TORCH_SDPA for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with --enable-optimizations and, if desired, opt into flash-attention explicitly with --attention-backend FLASH_ATTN.
The local Docker workflow builds a runtime image from the Dockerfile:
nvidia/cuda:13.1.1-devel-ubuntu24.04release/1.0.0 branch (for /v1/videos support)The Dockerfile exposes TORCH_CUDA_ARCH_LIST as a build argument (default: 10.0 10.0a for Blackwell). Pass --build-arg to target a different architecture:
MAX_JOBS (default: 4) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory:
When using Docker Compose, set these as environment variables before running docker compose up --build:
The first Docker image build can take 20–40+ minutes because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling flash-attention can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower MAX_JOBS in the Dockerfile to reduce parallel compile memory usage. The flash-attn install notes specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
On first start, workers download model weights. When --enable-optimizations is enabled, compile/warmup steps can push the first ready time to roughly 10–20 minutes (hardware-dependent). After the first successful optimized response, the second request can still take around 35 seconds while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
When using Kubernetes, mount a shared Hugging Face cache PVC (see Kubernetes Deployment) so model weights are downloaded once and reused across pod restarts.
For Docker Compose:
For host-local script:
The Compose file builds from the Dockerfile and exposes the API on http://localhost:8000. See the Docker Image Build section for build time expectations.
Environment variables:
Example:
--enable-optimizations and --attention-backend are worker.py flags, not dynamo.frontend flags, so pass them through WORKER_EXTRA_ARGS when you want a non-default worker configuration.
The script writes logs to:
.runtime/logs/worker.log.runtime/logs/frontend.logCreate a Hugging Face token secret if needed:
For clusters with tainted user-workload nodes and private registry pulls:
agg_user_workload.yaml.If this is the first request after startup, expect it to take longer while warmup completes. See Warmup Time for details.
Send a request and decode the response:
nvext)The example source lives at examples/diffusers/ in the Dynamo repository.
/v1/videos