SGLang Video Generation | NVIDIA AIPerf Documentation

Overview

This guide shows how to benchmark text-to-video generation APIs using SGLang and AIPerf. You’ll learn how to set up the SGLang video generation server, create input prompts, run benchmarks, and analyze the results.

Video generation follows an asynchronous job pattern:

Submit - POST to /v1/videos with your prompt, receive a job ID
Poll - GET /v1/videos/{id} until status is completed or failed
Download - GET /v1/videos/{id}/content to retrieve the generated video

AIPerf handles this polling workflow automatically.

References

For the most up-to-date information, please refer to the following resources:

Supported Models

AIPerf supports any SGLang-compatible text-to-video model, including:

Model	Model Path	Notes
Wan2.1-T2V	`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`	Lightweight, good for testing
Wan2.1-T2V (14B)	`Wan-AI/Wan2.1-T2V-14B-Diffusers`	Higher quality, requires more VRAM
HunyuanVideo	`tencent/HunyuanVideo`	Tencent’s video generation model

Setting Up the Server

Option 1: Docker (Recommended)

Export your Hugging Face token as an environment variable:

$ export HF_TOKEN=<your-huggingface-token>

Start the SGLang Docker container:

$ docker run --gpus all \
>     --shm-size 32g \
>     -it \
>     --rm \
>     -p 30010:30010 \
>     -v ~/.cache/huggingface:/root/.cache/huggingface \
>     --env "HF_TOKEN=$HF_TOKEN" \
>     --ipc=host \
>     lmsysorg/sglang:dev

The following steps are to be performed inside the SGLang Docker container.

Install the diffusion dependencies:

$ uv pip install "sglang[diffusion]" --prerelease=allow --system

Set the server arguments:

The following arguments set up the SGLang server to use Wan2.1-T2V-1.3B on port 30010. Adjust --num-gpus, --ulysses-degree, and --ring-degree based on your GPU configuration.

Single GPU setup:

$ SERVER_ARGS=(
>     --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>     --text-encoder-cpu-offload
>     --pin-cpu-memory
>     --num-gpus 1
>     --port 30010
>     --host 0.0.0.0
> )

Multi-GPU setup (4 GPUs with sequence parallelism):

$ SERVER_ARGS=(
>     --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>     --text-encoder-cpu-offload
>     --pin-cpu-memory
>     --num-gpus 4
>     --ulysses-degree 2
>     --ring-degree 2
>     --port 30010
>     --host 0.0.0.0
> )

Start the SGLang server:

$ sglang serve "${SERVER_ARGS[@]}"

Wait until the server is ready (watch the logs for the following message):

Uvicorn running on http://0.0.0.0:30010 (Press CTRL+C to quit)

Option 2: Native Installation

Install SGLang with diffusion support:

$ pip install --upgrade pip
$ pip install uv
$ uv pip install "sglang[diffusion]" --prerelease=allow

Start the server:

$ sglang serve \
>     --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --text-encoder-cpu-offload \
>     --pin-cpu-memory \
>     --num-gpus 1 \
>     --port 30010 \
>     --host 0.0.0.0

Running the Benchmark

The following steps are to be performed on your local machine (outside the SGLang Docker container).

Basic Usage: Text-to-Video with Input File

Create an input file with video prompts:

$ cat > video_prompts.jsonl << 'EOF'
$ {"text": "A serene lake at sunset with mountains in the background"}
$ {"text": "A cat playing with a ball of yarn in a cozy living room"}
$ {"text": "A futuristic city with flying cars and neon lights"}
$ EOF

Run the benchmark:

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --input-file video_prompts.jsonl \
>     --custom-dataset-type single_turn \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --concurrency 1 \
>     --request-count 3

Done! This sends 3 requests to http://localhost:30010/v1/videos and polls until each video is complete.

Sample Output (Successful Run):

                                       NVIDIA AIPerf | Video Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃                            Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│              Request Latency (ms) │ 45,234.56 │ 42,123.45 │ 48,567.89 │ 48,432.12 │ 47,654.32 │ 45,012.34 │ 2634.78 │
│    Input Sequence Length (tokens) │      8.33 │      7.00 │     10.00 │      9.98 │      9.80 │      8.00 │    1.25 │
│ Request Throughput (requests/sec) │      0.02 │         - │         - │         - │         - │         - │       - │
│          Request Count (requests) │      3.00 │         - │         - │         - │         - │         - │       - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴─────────┘

Basic Usage: Text-to-Video with Synthetic Prompts

Generate videos using synthetic prompts with configurable token lengths:

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --synthetic-input-tokens-mean 50 \
>     --synthetic-input-tokens-stddev 10 \
>     --concurrency 1 \
>     --request-count 5

Generation Parameters

Control video generation through --extra-inputs:

Parameter	Description	Example
`size`	Video resolution	`1280x720`, `720x1280`, `480x480`
`seconds`	Video duration in seconds	`4`, `8`, `12`
`seed`	Random seed for reproducibility	`42`
`num_inference_steps`	Diffusion denoising steps	`50`
`guidance_scale`	Classifier-free guidance scale	`7.5`
`negative_prompt`	Concepts to exclude	`"blurry, low quality"`
`fps`	Frames per second	`24`
`num_frames`	Total frames to generate	`48`

Video Download Option:

Use --download-video-content to include video content download in the benchmark timing. When enabled, request latency includes the time to download the generated video from the server. By default, only generation time is measured.

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --input-file video_prompts.jsonl \
>     --custom-dataset-type single_turn \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --download-video-content \
>     --concurrency 1 \
>     --request-count 3

Example with advanced parameters:

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --input-file video_prompts.jsonl \
>     --custom-dataset-type single_turn \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:8" \
>     --extra-inputs "seed:42" \
>     --extra-inputs "guidance_scale:7.5" \
>     --extra-inputs "num_inference_steps:50" \
>     --concurrency 1 \
>     --request-count 3

Polling Configuration

AIPerf automatically handles polling for video generation. Configure polling behavior:

Setting	Description	Default
`--request-timeout-seconds`	Maximum wait time before timeout	`21600` (6 hours)
`AIPERF_HTTP_VIDEO_POLL_INTERVAL`	Seconds between status checks (0.1-60)	`0.1`

Example with custom timeout and polling interval:

$ # Set slower polling (0.5s) with 20 minute timeout
$ AIPERF_HTTP_VIDEO_POLL_INTERVAL=0.5 aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --input-file video_prompts.jsonl \
>     --custom-dataset-type single_turn \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --request-timeout-seconds 1200 \
>     --concurrency 1 \
>     --request-count 3

Advanced Usage: Extracting Generated Videos

To extract and save the generated videos, use --export-level raw to capture the full response payloads.

Run the benchmark with raw export:

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --input-file video_prompts.jsonl \
>     --custom-dataset-type single_turn \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --concurrency 1 \
>     --request-count 3 \
>     --export-level raw

Download the generated videos:

The response contains a URL to download the video. Copy the following script to download_videos.py:

1 #!/usr/bin/env python3
2 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3 # SPDX-License-Identifier: Apache-2.0
4 """Download generated videos from AIPerf JSONL output file."""
5 import json
6 import os
7 from pathlib import Path
8 import sys
9 import urllib.request
10 
11 # Read input file path
12 input_file = Path(sys.argv[1]) if len(sys.argv) > 1 else Path(
13     'artifacts/Wan-AI_Wan2.1-T2V-1.3B-Diffusers-openai-video_generation-concurrency1/profile_export_raw.jsonl'
14 )
15 output_dir = Path(sys.argv[2]) if len(sys.argv) > 2 else Path('downloaded_videos')
16 
17 # Create output directory
18 os.makedirs(output_dir, exist_ok=True)
19 
20 # Process each line in the JSONL file
21 with open(input_file, 'r') as f:
22     for line_num, line in enumerate(f, 1):
23         record = json.loads(line)
24 
25         # Extract video URL from responses (look for the completed status response)
26         for response in record.get('responses', []):
27             response_data = json.loads(response.get('text', '{}'))
28 
29             # Check if this is a completed video response with a URL
30             if response_data.get('status') == 'completed' and response_data.get('url'):
31                 video_url = response_data['url']
32                 video_id = response_data.get('id', f'video_{line_num}')
33 
34                 # Download the video
35                 filename = output_dir / f"{video_id}.mp4"
36                 print(f"Downloading: {video_url}")
37 
38                 try:
39                     urllib.request.urlretrieve(video_url, filename)
40                     print(f"Saved: {filename.resolve()}")
41                 except Exception as e:
42                     print(f"Failed to download {video_id}: {e}")
43 
44 print(f"\nVideos saved to: {output_dir.resolve()}")

Run the script:

$ python download_videos.py

Output:

Downloading: http://localhost:30010/v1/videos/video_abc123/content
Saved: /path/to/downloaded_videos/video_abc123.mp4
Downloading: http://localhost:30010/v1/videos/video_def456/content
Saved: /path/to/downloaded_videos/video_def456.mp4
Downloading: http://localhost:30010/v1/videos/video_ghi789/content
Saved: /path/to/downloaded_videos/video_ghi789.mp4
Videos saved to: /path/to/downloaded_videos

Benchmark Scenarios

Scenario 1: Throughput Testing

Test maximum throughput with multiple concurrent requests:

$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:720x480" \
>     --extra-inputs "seconds:4" \
>     --synthetic-input-tokens-mean 30 \
>     --concurrency 4 \
>     --request-count 20

Scenario 2: Latency Testing

Test single-request latency for different video sizes:

$ # Short 4-second video
$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --synthetic-input-tokens-mean 50 \
>     --concurrency 1 \
>     --request-count 5
$ 
$ # Longer 8-second video
$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:8" \
>     --synthetic-input-tokens-mean 50 \
>     --concurrency 1 \
>     --request-count 5

Scenario 3: Quality vs Speed Trade-off

Compare generation quality at different inference step counts:

$ # Fast generation (fewer steps)
$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --extra-inputs "num_inference_steps:20" \
>     --synthetic-input-tokens-mean 50 \
>     --concurrency 1 \
>     --request-count 5
$ 
$ # High quality (more steps)
$ aiperf profile \
>     --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
>     --tokenizer gpt2 \
>     --url http://localhost:30010 \
>     --endpoint-type video_generation \
>     --extra-inputs "size:1280x720" \
>     --extra-inputs "seconds:4" \
>     --extra-inputs "num_inference_steps:100" \
>     --synthetic-input-tokens-mean 50 \
>     --concurrency 1 \
>     --request-count 5

Troubleshooting

Connection Refused

If you see Connection refused errors:

Verify the SGLang server is running: curl http://localhost:30010/health
Check the port matches your server configuration
If using Docker, ensure port mapping is correct (-p 30010:30010)

Timeout Errors

If requests time out during generation:

Increase the request timeout: --request-timeout-seconds 1200
Check server logs for errors
Reduce video resolution or duration for faster generation

Out of Memory

If the server crashes with OOM errors:

Use a smaller model (e.g., Wan2.1-T2V-1.3B instead of 14B)
Reduce video resolution: --extra-inputs "size:720x480"
Enable CPU offloading: --text-encoder-cpu-offload
Reduce concurrency: --concurrency 1

Model Not Found

If you see model loading errors:

Verify your Hugging Face token has access to the model
Check the model path is correct
Ensure sufficient disk space for model download

Response Fields

The video generation API returns the following fields:

Field	Description
`id`	Unique video job identifier (mapped to `video_id` internally)
`object`	Object type, always `"video"`
`status`	Job status: `queued`, `in_progress`, `completed`, `failed`
`progress`	Completion percentage (0-100)
`url`	Download URL (only when `status=completed`)
`size`	Video resolution (e.g., `"1280x720"`)
`seconds`	Video duration (returned as string)
`quality`	Quality setting for the generated video
`model`	Model used for generation
`created_at`	Unix timestamp of job creation
`completed_at`	Unix timestamp of completion
`expires_at`	Unix timestamp when video assets expire
`inference_time_s`	Total generation time in seconds
`peak_memory_mb`	Peak GPU memory usage in MB
`error`	Error details if `status=failed`

Conclusion

You’ve successfully set up SGLang for video generation, run benchmarks with AIPerf, and learned how to download the generated videos. You can now experiment with different models, prompts, resolutions, and generation parameters to optimize your text-to-video workloads.

Key takeaways:

Use --endpoint-type video_generation
Control video parameters via --extra-inputs
The transport handles polling automatically
Use --export-level raw to capture full responses for video extraction

Now go forth and generate!