SGLang Video Generation

View as Markdown

Overview

This guide shows how to benchmark text-to-video generation APIs using SGLang and AIPerf. You’ll learn how to set up the SGLang video generation server, create input prompts, run benchmarks, and analyze the results.

Video generation follows an asynchronous job pattern:

  1. Submit - POST to /v1/videos with your prompt, receive a job ID
  2. Poll - GET /v1/videos/{id} until status is completed or failed
  3. Download - GET /v1/videos/{id}/content to retrieve the generated video

AIPerf handles this polling workflow automatically.

References

For the most up-to-date information, please refer to the following resources:

Supported Models

AIPerf supports any SGLang-compatible text-to-video model, including:

ModelModel PathNotes
Wan2.1-T2VWan-AI/Wan2.1-T2V-1.3B-DiffusersLightweight, good for testing
Wan2.1-T2V (14B)Wan-AI/Wan2.1-T2V-14B-DiffusersHigher quality, requires more VRAM
HunyuanVideotencent/HunyuanVideoTencent’s video generation model

Setting Up the Server

Export your Hugging Face token as an environment variable:

$export HF_TOKEN=<your-huggingface-token>

Start the SGLang Docker container:

$docker run --gpus all \
> --shm-size 32g \
> -it \
> --rm \
> -p 30010:30010 \
> -v ~/.cache/huggingface:/root/.cache/huggingface \
> --env "HF_TOKEN=$HF_TOKEN" \
> --ipc=host \
> lmsysorg/sglang:dev

The following steps are to be performed inside the SGLang Docker container.

Install the diffusion dependencies:

$uv pip install "sglang[diffusion]" --prerelease=allow --system

Set the server arguments:

The following arguments set up the SGLang server to use Wan2.1-T2V-1.3B on port 30010. Adjust --num-gpus, --ulysses-degree, and --ring-degree based on your GPU configuration.

Single GPU setup:

$SERVER_ARGS=(
> --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
> --text-encoder-cpu-offload
> --pin-cpu-memory
> --num-gpus 1
> --port 30010
> --host 0.0.0.0
>)

Multi-GPU setup (4 GPUs with sequence parallelism):

$SERVER_ARGS=(
> --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
> --text-encoder-cpu-offload
> --pin-cpu-memory
> --num-gpus 4
> --ulysses-degree 2
> --ring-degree 2
> --port 30010
> --host 0.0.0.0
>)

Start the SGLang server:

$sglang serve "${SERVER_ARGS[@]}"

Wait until the server is ready (watch the logs for the following message):

Uvicorn running on http://0.0.0.0:30010 (Press CTRL+C to quit)

Option 2: Native Installation

Install SGLang with diffusion support:

$pip install --upgrade pip
$pip install uv
$uv pip install "sglang[diffusion]" --prerelease=allow

Start the server:

$sglang serve \
> --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --text-encoder-cpu-offload \
> --pin-cpu-memory \
> --num-gpus 1 \
> --port 30010 \
> --host 0.0.0.0

Running the Benchmark

The following steps are to be performed on your local machine (outside the SGLang Docker container).

Basic Usage: Text-to-Video with Input File

Create an input file with video prompts:

$cat > video_prompts.jsonl << 'EOF'
${"text": "A serene lake at sunset with mountains in the background"}
${"text": "A cat playing with a ball of yarn in a cozy living room"}
${"text": "A futuristic city with flying cars and neon lights"}
$EOF

Run the benchmark:

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --input-file video_prompts.jsonl \
> --custom-dataset-type single_turn \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --concurrency 1 \
> --request-count 3

Done! This sends 3 requests to http://localhost:30010/v1/videos and polls until each video is complete.

Sample Output (Successful Run):

NVIDIA AIPerf | Video Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 45,234.56 │ 42,123.45 │ 48,567.89 │ 48,432.12 │ 47,654.32 │ 45,012.34 │ 2634.78 │
│ Input Sequence Length (tokens) │ 8.33 │ 7.00 │ 10.00 │ 9.98 │ 9.80 │ 8.00 │ 1.25 │
│ Request Throughput (requests/sec) │ 0.02 │ - │ - │ - │ - │ - │ - │
│ Request Count (requests) │ 3.00 │ - │ - │ - │ - │ - │ - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴─────────┘

Basic Usage: Text-to-Video with Synthetic Prompts

Generate videos using synthetic prompts with configurable token lengths:

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --synthetic-input-tokens-mean 50 \
> --synthetic-input-tokens-stddev 10 \
> --concurrency 1 \
> --request-count 5

Generation Parameters

Control video generation through --extra-inputs:

ParameterDescriptionExample
sizeVideo resolution1280x720, 720x1280, 480x480
secondsVideo duration in seconds4, 8, 12
seedRandom seed for reproducibility42
num_inference_stepsDiffusion denoising steps50
guidance_scaleClassifier-free guidance scale7.5
negative_promptConcepts to exclude"blurry, low quality"
fpsFrames per second24
num_framesTotal frames to generate48

Video Download Option:

Use --download-video-content to include video content download in the benchmark timing. When enabled, request latency includes the time to download the generated video from the server. By default, only generation time is measured.

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --input-file video_prompts.jsonl \
> --custom-dataset-type single_turn \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --download-video-content \
> --concurrency 1 \
> --request-count 3

Example with advanced parameters:

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --input-file video_prompts.jsonl \
> --custom-dataset-type single_turn \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:8" \
> --extra-inputs "seed:42" \
> --extra-inputs "guidance_scale:7.5" \
> --extra-inputs "num_inference_steps:50" \
> --concurrency 1 \
> --request-count 3

Polling Configuration

AIPerf automatically handles polling for video generation. Configure polling behavior:

SettingDescriptionDefault
--request-timeout-secondsMaximum wait time before timeout21600 (6 hours)
AIPERF_HTTP_VIDEO_POLL_INTERVALSeconds between status checks (0.1-60)0.1

Example with custom timeout and polling interval:

$# Set slower polling (0.5s) with 20 minute timeout
$AIPERF_HTTP_VIDEO_POLL_INTERVAL=0.5 aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --input-file video_prompts.jsonl \
> --custom-dataset-type single_turn \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --request-timeout-seconds 1200 \
> --concurrency 1 \
> --request-count 3

Advanced Usage: Extracting Generated Videos

To extract and save the generated videos, use --export-level raw to capture the full response payloads.

Run the benchmark with raw export:

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --input-file video_prompts.jsonl \
> --custom-dataset-type single_turn \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --concurrency 1 \
> --request-count 3 \
> --export-level raw

Download the generated videos:

The response contains a URL to download the video. Copy the following script to download_videos.py:

1#!/usr/bin/env python3
2# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3# SPDX-License-Identifier: Apache-2.0
4"""Download generated videos from AIPerf JSONL output file."""
5import json
6import os
7from pathlib import Path
8import sys
9import urllib.request
10
11# Read input file path
12input_file = Path(sys.argv[1]) if len(sys.argv) > 1 else Path(
13 'artifacts/Wan-AI_Wan2.1-T2V-1.3B-Diffusers-openai-video_generation-concurrency1/profile_export_raw.jsonl'
14)
15output_dir = Path(sys.argv[2]) if len(sys.argv) > 2 else Path('downloaded_videos')
16
17# Create output directory
18os.makedirs(output_dir, exist_ok=True)
19
20# Process each line in the JSONL file
21with open(input_file, 'r') as f:
22 for line_num, line in enumerate(f, 1):
23 record = json.loads(line)
24
25 # Extract video URL from responses (look for the completed status response)
26 for response in record.get('responses', []):
27 response_data = json.loads(response.get('text', '{}'))
28
29 # Check if this is a completed video response with a URL
30 if response_data.get('status') == 'completed' and response_data.get('url'):
31 video_url = response_data['url']
32 video_id = response_data.get('id', f'video_{line_num}')
33
34 # Download the video
35 filename = output_dir / f"{video_id}.mp4"
36 print(f"Downloading: {video_url}")
37
38 try:
39 urllib.request.urlretrieve(video_url, filename)
40 print(f"Saved: {filename.resolve()}")
41 except Exception as e:
42 print(f"Failed to download {video_id}: {e}")
43
44print(f"\nVideos saved to: {output_dir.resolve()}")

Run the script:

$python download_videos.py

Output:

Downloading: http://localhost:30010/v1/videos/video_abc123/content
Saved: /path/to/downloaded_videos/video_abc123.mp4
Downloading: http://localhost:30010/v1/videos/video_def456/content
Saved: /path/to/downloaded_videos/video_def456.mp4
Downloading: http://localhost:30010/v1/videos/video_ghi789/content
Saved: /path/to/downloaded_videos/video_ghi789.mp4
Videos saved to: /path/to/downloaded_videos

Benchmark Scenarios

Scenario 1: Throughput Testing

Test maximum throughput with multiple concurrent requests:

$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:720x480" \
> --extra-inputs "seconds:4" \
> --synthetic-input-tokens-mean 30 \
> --concurrency 4 \
> --request-count 20

Scenario 2: Latency Testing

Test single-request latency for different video sizes:

$# Short 4-second video
$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --synthetic-input-tokens-mean 50 \
> --concurrency 1 \
> --request-count 5
$
$# Longer 8-second video
$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:8" \
> --synthetic-input-tokens-mean 50 \
> --concurrency 1 \
> --request-count 5

Scenario 3: Quality vs Speed Trade-off

Compare generation quality at different inference step counts:

$# Fast generation (fewer steps)
$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --extra-inputs "num_inference_steps:20" \
> --synthetic-input-tokens-mean 50 \
> --concurrency 1 \
> --request-count 5
$
$# High quality (more steps)
$aiperf profile \
> --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --tokenizer gpt2 \
> --url http://localhost:30010 \
> --endpoint-type video_generation \
> --extra-inputs "size:1280x720" \
> --extra-inputs "seconds:4" \
> --extra-inputs "num_inference_steps:100" \
> --synthetic-input-tokens-mean 50 \
> --concurrency 1 \
> --request-count 5

Troubleshooting

Connection Refused

If you see Connection refused errors:

  1. Verify the SGLang server is running: curl http://localhost:30010/health
  2. Check the port matches your server configuration
  3. If using Docker, ensure port mapping is correct (-p 30010:30010)

Timeout Errors

If requests time out during generation:

  1. Increase the request timeout: --request-timeout-seconds 1200
  2. Check server logs for errors
  3. Reduce video resolution or duration for faster generation

Out of Memory

If the server crashes with OOM errors:

  1. Use a smaller model (e.g., Wan2.1-T2V-1.3B instead of 14B)
  2. Reduce video resolution: --extra-inputs "size:720x480"
  3. Enable CPU offloading: --text-encoder-cpu-offload
  4. Reduce concurrency: --concurrency 1

Model Not Found

If you see model loading errors:

  1. Verify your Hugging Face token has access to the model
  2. Check the model path is correct
  3. Ensure sufficient disk space for model download

Response Fields

The video generation API returns the following fields:

FieldDescription
idUnique video job identifier (mapped to video_id internally)
objectObject type, always "video"
statusJob status: queued, in_progress, completed, failed
progressCompletion percentage (0-100)
urlDownload URL (only when status=completed)
sizeVideo resolution (e.g., "1280x720")
secondsVideo duration (returned as string)
qualityQuality setting for the generated video
modelModel used for generation
created_atUnix timestamp of job creation
completed_atUnix timestamp of completion
expires_atUnix timestamp when video assets expire
inference_time_sTotal generation time in seconds
peak_memory_mbPeak GPU memory usage in MB
errorError details if status=failed

Conclusion

You’ve successfully set up SGLang for video generation, run benchmarks with AIPerf, and learned how to download the generated videos. You can now experiment with different models, prompts, resolutions, and generation parameters to optimize your text-to-video workloads.

Key takeaways:

  • Use --endpoint-type video_generation
  • Control video parameters via --extra-inputs
  • The transport handles polling automatically
  • Use --export-level raw to capture full responses for video extraction

Now go forth and generate!