FastVideo

View as Markdown

This guide covers deploying FastVideo text-to-video generation on Dynamo using a custom worker (worker.py) exposed through the /v1/videos endpoint.

Dynamo also supports diffusion through built-in backends: SGLang Diffusion (LLM diffusion, image, video), vLLM-Omni (text-to-image, text-to-video), and TRT-LLM Diffusion (text-to-image, text-to-video). See the Diffusion Overview for the full support matrix.

Overview

  • Default model: FastVideo/LTX2-Distilled-Diffusers — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
  • Two-stage pipeline: Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
  • Optimized inference: FP4 quantization and torch.compile are available via --enable-optimizations; attention backend selection is controlled separately via --attention-backend.
  • Response format: Returns one complete MP4 payload per request as data[0].b64_json (non-streaming).
  • Concurrency: One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.

worker.py defaults to --attention-backend TORCH_SDPA for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with --enable-optimizations and, if desired, opt into flash-attention explicitly with --attention-backend FLASH_ATTN.

Docker Image Build

The local Docker workflow builds a runtime image from the Dockerfile:

  • Base image: nvidia/cuda:13.1.1-devel-ubuntu24.04
  • Installs FastVideo from GitHub
  • Installs Dynamo from the release/1.0.0 branch (for /v1/videos support)
  • Compiles a flash-attention fork from source

The Dockerfile exposes TORCH_CUDA_ARCH_LIST as a build argument (default: 10.0 10.0a for Blackwell). Pass --build-arg to target a different architecture:

$# Blackwell (default)
$docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="10.0 10.0a"
$
$# Hopper
$docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a"

MAX_JOBS (default: 4) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory:

$docker build examples/diffusers/ --build-arg MAX_JOBS=2

When using Docker Compose, set these as environment variables before running docker compose up --build:

$# Hopper on a memory-constrained builder
$TORCH_CUDA_ARCH_LIST="9.0 9.0a" MAX_JOBS=2 COMPOSE_PROFILES=4 docker compose up --build

The first Docker image build can take 20–40+ minutes because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling flash-attention can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower MAX_JOBS in the Dockerfile to reduce parallel compile memory usage. The flash-attn install notes specifically recommend this on machines with less than 96 GB RAM and many CPU cores.

Warmup Time

On first start, workers download model weights. When --enable-optimizations is enabled, compile/warmup steps can push the first ready time to roughly 10–20 minutes (hardware-dependent). After the first successful optimized response, the second request can still take around 35 seconds while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.

When using Kubernetes, mount a shared Hugging Face cache PVC (see Kubernetes Deployment) so model weights are downloaded once and reused across pod restarts.

Local Deployment

Prerequisites

For Docker Compose:

  • Docker Engine 26.0+
  • Docker Compose v2
  • NVIDIA Container Toolkit

For host-local script:

  • Python environment with Dynamo + FastVideo dependencies installed
  • CUDA-compatible GPU runtime available on host

Option 1: Docker Compose

$cd <dynamo-root>/examples/diffusers/local
$
$# Start 4 workers on GPUs 0..3
$COMPOSE_PROFILES=4 docker compose up --build

The Compose file builds from the Dockerfile and exposes the API on http://localhost:8000. See the Docker Image Build section for build time expectations.

Option 2: Host-Local Script

$cd <dynamo-root>/examples/diffusers/local
$./run_local.sh

Environment variables:

VariableDefaultDescription
PYTHON_BINpython3Python interpreter
MODELFastVideo/LTX2-Distilled-DiffusersHuggingFace model path
NUM_GPUS1Number of GPUs
HTTP_PORT8000Frontend HTTP port
WORKER_EXTRA_ARGSExtra flags for worker.py (for example, --enable-optimizations --attention-backend FLASH_ATTN)
FRONTEND_EXTRA_ARGSExtra flags for dynamo.frontend

Example:

$MODEL=FastVideo/LTX2-Distilled-Diffusers \
>NUM_GPUS=1 \
>HTTP_PORT=8000 \
>WORKER_EXTRA_ARGS="--enable-optimizations --attention-backend FLASH_ATTN" \
>./run_local.sh

--enable-optimizations and --attention-backend are worker.py flags, not dynamo.frontend flags, so pass them through WORKER_EXTRA_ARGS when you want a non-default worker configuration.

The script writes logs to:

  • .runtime/logs/worker.log
  • .runtime/logs/frontend.log

Kubernetes Deployment

Files

FileDescription
agg.yamlBase aggregated deployment (Frontend + FastVideoWorker)
agg_user_workload.yamlSame deployment with user-workload tolerations and imagePullSecrets
huggingface-cache-pvc.yamlShared HF cache PVC for model weights
dynamo-platform-values-user-workload.yamlOptional Helm values for clusters with tainted user-workload nodes

Prerequisites

  1. Dynamo Kubernetes Platform installed
  2. GPU-enabled Kubernetes cluster
  3. FastVideo runtime image pushed to your registry
  4. Optional HF token secret (for gated models)

Create a Hugging Face token secret if needed:

$export NAMESPACE=<your-namespace>
$export HF_TOKEN=<your-hf-token>
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN=${HF_TOKEN} \
> -n ${NAMESPACE}

Deploy

$cd <dynamo-root>/examples/diffusers/deploy
$export NAMESPACE=<your-namespace>
$
$kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
$kubectl apply -f agg.yaml -n ${NAMESPACE}

For clusters with tainted user-workload nodes and private registry pulls:

  1. Set your pull secret name and image in agg_user_workload.yaml.
  2. Apply:
$kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
$kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE}

Update Image Quickly

$export DEPLOYMENT_FILE=agg.yaml
$export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag>
$
$yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \
> ${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated
$
$kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE}

Verify and Access

$kubectl get dgd -n ${NAMESPACE}
$kubectl get pods -n ${NAMESPACE}
$kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker
$kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000

Test Request

If this is the first request after startup, expect it to take longer while warmup completes. See Warmup Time for details.

Send a request and decode the response:

$curl -s -X POST http://localhost:8000/v1/videos \
> -H 'Content-Type: application/json' \
> -d '{
> "model": "FastVideo/LTX2-Distilled-Diffusers",
> "prompt": "A cinematic drone shot over a snowy mountain range at sunrise",
> "size": "1920x1088",
> "seconds": 5,
> "nvext": {
> "fps": 24,
> "num_frames": 121,
> "num_inference_steps": 5,
> "guidance_scale": 1.0,
> "seed": 10
> }
> }' > response.json
$
$# Linux
$jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4
$
$# macOS
$jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4

Worker Configuration Reference

CLI Flags

FlagDefaultDescription
--modelFastVideo/LTX2-Distilled-DiffusersHuggingFace model path
--num-gpus1Number of GPUs for distributed inference
--enable-optimizationsoffEnables FP4 quantization and torch.compile
--attention-backendTORCH_SDPASets FASTVIDEO_ATTENTION_BACKEND; choices: FLASH_ATTN, TORCH_SDPA, SAGE_ATTN, SAGE_ATTN_THREE, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SLA_ATTN, SAGE_SLA_ATTN

Request Parameters (nvext)

FieldDefaultDescription
fps24Frames per second
num_frames121Total frames; overrides fps * seconds when set
num_inference_steps5Diffusion inference steps
guidance_scale1.0Classifier-free guidance scale
seed10RNG seed for reproducibility
negative_promptText to avoid in generation

Environment Variables

VariableDefaultDescription
FASTVIDEO_VIDEO_CODEClibx264Video codec for MP4 encoding
FASTVIDEO_X264_PRESETultrafastx264 encoding speed preset
FASTVIDEO_ATTENTION_BACKENDTORCH_SDPAAttention backend; worker.py sets this from --attention-backend and validates FLASH_ATTN, TORCH_SDPA, SAGE_ATTN, SAGE_ATTN_THREE, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SLA_ATTN, and SAGE_SLA_ATTN
FASTVIDEO_STAGE_LOGGING1Enable per-stage timing logs
FASTVIDEO_LOG_LEVELSet to DEBUG for verbose logging

Troubleshooting

SymptomCauseFix
OOM during Docker buildflash-attention compilation uses too much RAMPass --build-arg MAX_JOBS=2 (or lower) at build time
no kernel image available for this GPU or CUDA arch error at runtimeImage was built for a different GPU architectureRebuild with the correct TORCH_CUDA_ARCH_LIST (e.g. 9.0 9.0a for Hopper)
10–20 min wait on first start with optimizations enabledModel download + torch.compile warmupExpected behavior; subsequent starts are faster if weights are cached
~35 s second requestRuntime caches still warmingSteady-state performance from third request onward
Lower throughput than expected on B200/B300FP4/compile and flash-attention are configured separatelyPass --enable-optimizations and, if desired, --attention-backend FLASH_ATTN
Startup or import failure after enabling optimizations or changing the attention backendFP4 and some attention backends depend on specific hardware/software supportRe-run worker.py without --enable-optimizations, or use --attention-backend TORCH_SDPA

Source Code

The example source lives at examples/diffusers/ in the Dynamo repository.

See Also