FastVideo

View as Markdown

This guide covers deploying FastVideo text-to-video generation on Dynamo using a custom worker (worker.py) exposed through the /v1/videos endpoint.

Dynamo also supports diffusion through built-in backends: SGLang Diffusion (LLM diffusion, image, video), vLLM-Omni (text-to-image, text-to-video), and TRT-LLM Video Diffusion. See the Diffusion Overview for the full support matrix.

Overview

  • Default model: FastVideo/LTX2-Distilled-Diffusers — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
  • Two-stage pipeline: Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
  • Optimized inference: FP4 quantization and torch.compile are enabled by default for maximum throughput.
  • Response format: Returns one complete MP4 payload per request as data[0].b64_json (non-streaming).
  • Concurrency: One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.

This example is optimized for NVIDIA B200/B300 GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing --disable-optimizations to worker.py, which disables FP4 quantization, torch.compile, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.

Docker Image Build

The local Docker workflow builds a runtime image from the Dockerfile:

  • Base image: nvidia/cuda:13.1.1-devel-ubuntu24.04
  • Installs FastVideo from GitHub
  • Installs Dynamo from the release/1.0.0 branch (for /v1/videos support)
  • Compiles a flash-attention fork from source

The first Docker image build can take 20–40+ minutes because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling flash-attention can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower MAX_JOBS in the Dockerfile to reduce parallel compile memory usage. The flash-attn install notes specifically recommend this on machines with less than 96 GB RAM and many CPU cores.

Warmup Time

On first start, workers download model weights and run compile/warmup steps. Expect roughly 10–20 minutes before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around 35 seconds while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.

When using Kubernetes, mount a shared Hugging Face cache PVC (see Kubernetes Deployment) so model weights are downloaded once and reused across pod restarts.

Local Deployment

Prerequisites

For Docker Compose:

  • Docker Engine 26.0+
  • Docker Compose v2
  • NVIDIA Container Toolkit

For host-local script:

  • Python environment with Dynamo + FastVideo dependencies installed
  • CUDA-compatible GPU runtime available on host

Option 1: Docker Compose

$cd <dynamo-root>/examples/diffusers/local
$
$# Start 4 workers on GPUs 0..3
$COMPOSE_PROFILES=4 docker compose up --build

The Compose file builds from the Dockerfile and exposes the API on http://localhost:8000. See the Docker Image Build section for build time expectations.

Option 2: Host-Local Script

$cd <dynamo-root>/examples/diffusers/local
$./run_local.sh

Environment variables:

VariableDefaultDescription
PYTHON_BINpython3Python interpreter
MODELFastVideo/LTX2-Distilled-DiffusersHuggingFace model path
NUM_GPUS1Number of GPUs
HTTP_PORT8000Frontend HTTP port
WORKER_EXTRA_ARGSExtra flags for worker.py (e.g., --disable-optimizations)
FRONTEND_EXTRA_ARGSExtra flags for dynamo.frontend

Example:

$MODEL=FastVideo/LTX2-Distilled-Diffusers \
>NUM_GPUS=1 \
>HTTP_PORT=8000 \
>WORKER_EXTRA_ARGS="--disable-optimizations" \
>./run_local.sh

--disable-optimizations is a worker.py flag (not a dynamo.frontend flag), so pass it through WORKER_EXTRA_ARGS.

The script writes logs to:

  • .runtime/logs/worker.log
  • .runtime/logs/frontend.log

Kubernetes Deployment

Files

FileDescription
agg.yamlBase aggregated deployment (Frontend + FastVideoWorker)
agg_user_workload.yamlSame deployment with user-workload tolerations and imagePullSecrets
huggingface-cache-pvc.yamlShared HF cache PVC for model weights
dynamo-platform-values-user-workload.yamlOptional Helm values for clusters with tainted user-workload nodes

Prerequisites

  1. Dynamo Kubernetes Platform installed
  2. GPU-enabled Kubernetes cluster
  3. FastVideo runtime image pushed to your registry
  4. Optional HF token secret (for gated models)

Create a Hugging Face token secret if needed:

$export NAMESPACE=<your-namespace>
$export HF_TOKEN=<your-hf-token>
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN=${HF_TOKEN} \
> -n ${NAMESPACE}

Deploy

$cd <dynamo-root>/examples/diffusers/deploy
$export NAMESPACE=<your-namespace>
$
$kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
$kubectl apply -f agg.yaml -n ${NAMESPACE}

For clusters with tainted user-workload nodes and private registry pulls:

  1. Set your pull secret name and image in agg_user_workload.yaml.
  2. Apply:
$kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
$kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE}

Update Image Quickly

$export DEPLOYMENT_FILE=agg.yaml
$export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag>
$
$yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \
> ${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated
$
$kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE}

Verify and Access

$kubectl get dgd -n ${NAMESPACE}
$kubectl get pods -n ${NAMESPACE}
$kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker
$kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000

Test Request

If this is the first request after startup, expect it to take longer while warmup completes. See Warmup Time for details.

Send a request and decode the response:

$curl -s -X POST http://localhost:8000/v1/videos \
> -H 'Content-Type: application/json' \
> -d '{
> "model": "FastVideo/LTX2-Distilled-Diffusers",
> "prompt": "A cinematic drone shot over a snowy mountain range at sunrise",
> "size": "1920x1088",
> "seconds": 5,
> "nvext": {
> "fps": 24,
> "num_frames": 121,
> "num_inference_steps": 5,
> "guidance_scale": 1.0,
> "seed": 10
> }
> }' > response.json
$
$# Linux
$jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4
$
$# macOS
$jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4

Worker Configuration Reference

CLI Flags

FlagDefaultDescription
--modelFastVideo/LTX2-Distilled-DiffusersHuggingFace model path
--num-gpus1Number of GPUs for distributed inference
--disable-optimizationsoffDisables FP4 quantization, torch.compile, and switches attention from FLASH_ATTN to TORCH_SDPA

Request Parameters (nvext)

FieldDefaultDescription
fps24Frames per second
num_frames121Total frames; overrides fps * seconds when set
num_inference_steps5Diffusion inference steps
guidance_scale1.0Classifier-free guidance scale
seed10RNG seed for reproducibility
negative_promptText to avoid in generation

Environment Variables

VariableDefaultDescription
FASTVIDEO_VIDEO_CODEClibx264Video codec for MP4 encoding
FASTVIDEO_X264_PRESETultrafastx264 encoding speed preset
FASTVIDEO_ATTENTION_BACKENDFLASH_ATTNAttention backend (FLASH_ATTN or TORCH_SDPA)
FASTVIDEO_STAGE_LOGGING1Enable per-stage timing logs
FASTVIDEO_LOG_LEVELSet to DEBUG for verbose logging

Troubleshooting

SymptomCauseFix
OOM during Docker buildflash-attention compilation uses too much RAMLower MAX_JOBS in the Dockerfile
10–20 min wait on first startModel download + torch.compile warmupExpected behavior; subsequent starts are faster if weights are cached
~35 s second requestRuntime caches still warmingSteady-state performance from third request onward
Poor performance on non-B200/B300 GPUsFP4 and flash-attention optimizations require CUDA arch 10.0Pass --disable-optimizations to worker.py

Source Code

The example source lives at examples/diffusers/ in the Dynamo repository.

See Also