SGLang Image Generation | NVIDIA AIPerf Documentation

Overview

This guide shows how to benchmark image generation APIs using a Docker-based server and AIPerf. You’ll learn how to:

Set up the server
Create an input file and run the benchmark
View the results and extract the generated images

References

For the most up-to-date information, please refer to the following resources:

Setting up the server

Login to Hugging Face, and accept the terms of use for the following model: FLUX.1-dev.

Export your Hugging Face token as an environment variable:

$ export HF_TOKEN=<your-huggingface-token>

Start the Docker container:

$ docker run --gpus all \
>     --shm-size 32g \
>     -it \
>     --rm \
>     -p 30000:30000 \
>     -v ~/.cache/huggingface:/root/.cache/huggingface \
>     --env "HF_TOKEN=$HF_TOKEN" \
>     --ipc=host \
>     lmsysorg/sglang:dev

> The following steps are to be performed inside the Docker container.

Install the dependencies:

$ pip install yunchang remote_pdb imageio diffusers diffusion

Set the server arguments:

The following arguments will setup the server to use the FLUX.1-dev model on a single GPU, on port 30000. You can modify these arguments to use a different model, different number of GPUs, different port, etc. See the SGLang Image Generation CLI for more details.

$ SERVER_ARGS=(   --model-path black-forest-labs/FLUX.1-dev   --text-encoder-cpu-offload   --pin-cpu-memory   --num-gpus 1   --port 30000 --host 0.0.0.0 )

Start the server:

$ sglang serve "${SERVER_ARGS[@]}"

Wait until the server is ready (watch the logs for the following message):

$ Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

Running the benchmark (basic usage)

> The following steps are to be performed on your local machine. (outside the Docker container.)

Text-to-Image Generation Using Input File

Create an input file:

$ cat > image_prompts.jsonl << 'EOF'
$ {"text": "A serene mountain landscape at sunset"}
$ {"text": "A futuristic city with flying cars"}
$ {"text": "A cute robot playing with a kitten"}
$ EOF

Run the benchmark:

$ aiperf profile \
>   --model black-forest-labs/FLUX.1-dev \
>   --tokenizer gpt2 \
>   --url http://localhost:30000 \
>   --endpoint-type image_generation \
>   --input-file image_prompts.jsonl \
>   --custom-dataset-type single_turn \
>   --extra-inputs size:512x512 \
>   --extra-inputs quality:standard \
>   --concurrency 1 \
>   --request-count 3

Done! This sends 3 requests to http://localhost:30000/v1/images/generations

Sample Output (Successful Run):

                                       NVIDIA AIPerf | Image Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                            Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │ 12,617.58 │ 12,251.41 │ 12,954.04 │ 12,947.91 │ 12,892.69 │ 12,647.29 │ 287.62 │
│    Input Sequence Length (tokens) │      6.67 │      6.00 │      7.00 │      7.00 │      7.00 │      7.00 │   0.47 │
│ Request Throughput (requests/sec) │      0.08 │         - │         - │         - │         - │         - │      - │
│          Request Count (requests) │      3.00 │         - │         - │         - │         - │         - │      - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

Text-to-Image Generation Using Synthetic Inputs

$ aiperf profile \
>   --model black-forest-labs/FLUX.1-dev \
>   --tokenizer gpt2 \
>   --url http://localhost:30000 \
>   --endpoint-type image_generation \
>   --extra-inputs size:512x512 \
>   --extra-inputs quality:standard \
>   --synthetic-input-tokens-mean 150 \
>   --synthetic-input-tokens-stddev 30 \
>   --concurrency 1 \
>   --request-count 3

Done! This sends 3 requests to http://localhost:30000/v1/images/generations

Sample Output (Successful Run):

                                       NVIDIA AIPerf | Image Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                            Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │ 12,173.18 │ 11,918.37 │ 12,503.38 │ 12,495.27 │ 12,422.26 │ 12,097.79 │ 244.71 │
│    Input Sequence Length (tokens) │    137.00 │    107.00 │    153.00 │    152.96 │    152.60 │    151.00 │  21.23 │
│ Request Throughput (requests/sec) │      0.08 │         - │         - │         - │         - │         - │      - │
│          Request Count (requests) │      3.00 │         - │         - │         - │         - │         - │      - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

Understanding the Metrics

Image generation endpoints report a focused set of metrics. Unlike LLM text endpoints, there are no token-level streaming metrics (TTFT, ITL) since the image is returned as a single response.

Metric	Description
Request Latency (ms)	End-to-end image generation time — from sending the request to receiving the complete image. This is the primary measure of image generation speed.
Input Sequence Length (tokens)	Token count of the text prompt used to generate the image.
Request Throughput (requests/sec)	Number of images generated per second across all concurrent workers.
Request Count (requests)	Total number of completed image generation requests.

To increase throughput, raise --concurrency. Each concurrent worker sends requests independently, allowing multiple images to be generated in parallel.

Running the benchmark (advanced usage)

Create an input file:

$ cat > image_prompts.jsonl << 'EOF'
$ {"text": "A serene mountain landscape at sunset"}
$ {"text": "A futuristic city with flying cars"}
$ {"text": "A cute robot playing with a kitten"}
$ EOF

Run the benchmark:

Use --export-level raw to get the raw input/output payloads.

$ aiperf profile \
>   --model black-forest-labs/FLUX.1-dev \
>   --tokenizer gpt2 \
>   --url http://localhost:30000 \
>   --endpoint-type image_generation \
>   --input-file image_prompts.jsonl \
>   --custom-dataset-type single_turn \
>   --extra-inputs size:512x512 \
>   --extra-inputs quality:standard \
>   --concurrency 1 \
>   --request-count 3 \
>   --export-level raw

Viewing the generated images

Extract the generated images:
Copy the following code into a file called extract_images.py:

extract_images.py

1 #!/usr/bin/env python3
2 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3 # SPDX-License-Identifier: Apache-2.0
4 """Extract base64-encoded images from AIPerf JSONL output file."""
5 import base64
6 import json
7 import os
8 from pathlib import Path
9 import sys
10 
11 # Read input file path
12 input_file = Path(sys.argv[1]) if len(sys.argv) > 1 else Path('artifacts/black-forest-labs_FLUX.1-dev-openai-image_generation-concurrency1/profile_export_raw.jsonl')
13 output_dir = Path(sys.argv[2]) if len(sys.argv) > 2 else Path('extracted_images')
14 
15 # Create output directory
16 os.makedirs(output_dir, exist_ok=True)
17 
18 # Process each line in the JSONL file
19 with open(input_file, 'r') as f:
20     for line_num, line in enumerate(f, 1):
21         record = json.loads(line)
22 
23         # Extract images from responses
24         for response in record.get('responses', []):
25             response_data = json.loads(response.get('text', '{}'))
26 
27             for data_idx, item in enumerate(response_data.get('data', [])):
28                 if b64_image := item.get('b64_json'):
29                     # Decode and save image
30                     image_data = base64.b64decode(b64_image)
31                     filename = output_dir / f"image_{line_num:04d}_{data_idx:02d}.jpg"
32 
33                     with open(filename, 'wb') as img_file:
34                         img_file.write(image_data)
35 
36                     print(f"Extracted: {filename.resolve()}")

Run the script:

The script is setup to use the default directory and file names for the input and output files, but can be modified to use different files.

Usage: python extract_images.py <input_file> <output_dir>

$ python extract_images.py

Output:

Extracted: /path/to/extracted_images/image_0001_00.jpg
Extracted: /path/to/extracted_images/image_0001_01.jpg
Extracted: /path/to/extracted_images/image_0001_02.jpg

View the generated images:

Prompt:

{"text": "A serene mountain landscape at sunset"}

Generated image: a serene mountain landscape at sunset

Prompt:

{"text": "A futuristic city with flying cars"}

Generated image: a futuristic city with flying cars

Prompt:

{"text": "A cute robot playing with a kitten"}

Generated image: a cute robot playing with a kitten

Conclusion

You’ve successfully set up an image generation server, run your first benchmarks, and learned how to extract and view the generated images. You can now experiment with different models, prompts, and concurrency settings to optimize your image generation workloads.

Now go forth and generate!