For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/aiperf/tutorials/datasets-inputs/llms.txt. For full documentation content, see https://docs.nvidia.com/aiperf/tutorials/datasets-inputs/llms-full.txt.

# Synthetic Dataset Generation

AIPerf generates synthetic datasets for benchmarking LLM inference servers. This tutorial explains how synthetic data is generated for text, images, audio, and video inputs.

## Overview

Synthetic datasets enable consistent, reproducible benchmarking with full control over input characteristics. Each modality uses a specialized generator:

| Modality | Source Material | Configurable Properties |
|----------|-----------------|-------------------------|
| **Text** | Shakespeare corpus | Token length, distribution |
| **Images** | 4 source images | Width, height, format |
| **Audio** | Gaussian noise | Duration, sample rate, bit depth, channels |
| **Video** | Synthetic animations | Resolution, FPS, duration, codec |

All generators use deterministic random sampling for reproducibility (see [Reproducibility Guide](/aiperf/tutorials/configuration/random-number-generation-reproducibility)).

---

## Text/Prompt Generation

### How It Works

Text prompts are generated by sampling from a pre-tokenized Shakespeare corpus:

1. **Corpus Loading**: The `assets/shakespeare.txt` file is tokenized once at startup
2. **Character-Based Chunking**: Text is split into fixed-size chunks for parallel tokenization
3. **Deterministic Sampling**: Random slices of the tokenized corpus are extracted and decoded into prompts
4. **Length Control**: Prompt lengths follow a normal distribution around specified mean/stddev

**Key Feature**: Character-based chunking ensures reproducibility across machines with different CPU counts - same random seed always produces identical prompts.

### Configuration

```bash
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --url localhost:8000 \
  --endpoint-type chat \
  --synthetic-input-tokens-mean 150 \
  --synthetic-input-tokens-stddev 30 \
  --output-tokens-mean 50 \
  --request-count 10
```

**Options:**
- `--synthetic-input-tokens-mean`: Mean input token count (default: 550)
- `--synthetic-input-tokens-stddev`: Standard deviation for input length variability (default: 0)
- `--output-tokens-mean`: Mean number of output tokens requested (default: None — model decides)
- `--output-tokens-stddev`: Standard deviation for output token length (default: 0)
- `--seq-dist`: Distribution of (ISL, OSL) pairs for mixed workload simulation (default: None). See [Sequence Length Distributions](/aiperf/tutorials/datasets-inputs/sequence-length-distributions-for-advanced-benchmarking) for format details.
- `--random-seed`: Seed for reproducible prompt generation (default: None)

### Advanced: Prefix Synthesis

For shared-prefix benchmarking (e.g., RAG scenarios):

```bash
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --url localhost:8000 \
  --endpoint-type chat \
  --synthetic-input-tokens-mean 100 \
  --prefix-prompt-length 512 \
  --prefix-prompt-pool-size 10 \
  --request-count 10
```

Each request randomly selects a 512-token prefix from a pool of 10, with a randomly sampled 100-token continuation. See [Prefix Synthesis](/aiperf/tutorials/datasets-inputs/prefix-data-synthesis-tutorial) for details.

---

## Image Generation

### How It Works

Images are generated by resizing source images from `assets/source_images/`:

1. **Source Images**: 4 source images pre-loaded into memory
2. **Random Selection**: One source image is randomly selected for each generation
3. **Resizing**: Image is resized to target dimensions using PIL (Pillow)
4. **Format Conversion**: Converted to the configured format (PNG, JPEG, or randomly selected)
5. **Base64 Encoding**: Encoded as data URI for API requests

### Configuration

```bash
aiperf profile \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --url localhost:8000 \
  --endpoint-type chat \
  --image-width-mean 512 \
  --image-height-mean 512 \
  --image-width-stddev 50 \
  --image-height-stddev 50 \
  --image-format png \
  --image-batch-size 2 \
  --synthetic-input-tokens-mean 100 \
  --request-count 5
```

**Options:**
- `--image-width-mean`: Mean width in pixels (default: 0)
- `--image-width-stddev`: Width standard deviation (default: 0)
- `--image-height-mean`: Mean height in pixels (default: 0)
- `--image-height-stddev`: Height standard deviation (default: 0)
- `--image-format`: `png`, `jpeg`, or `random` (default: `png`)
- `--image-batch-size`: Number of images per request (default: 1)

**Note**: Image generation requires both `--image-width-mean` and `--image-height-mean` to be > 0. Setting either to 0 disables images.

---

## Audio Generation

### How It Works

Audio files are generated as synthetic Gaussian noise:

1. **Parameter Selection**: Random selection of sample rate and bit depth from configured lists
2. **Duration Sampling**: Duration follows normal distribution (with rejection sampling for ≥0.01s)
3. **Noise Generation**: Gaussian noise generated as a NumPy array
4. **Scaling**: Clipped to [-1, 1] and scaled to the target bit depth range
5. **Encoding**: Written as WAV or MP3 using soundfile library
6. **Base64 Encoding**: Encoded as `<format>,<base64data>` string

**Audio Characteristics**:
- White noise (all frequencies equally represented)
- Gaussian amplitude distribution
- No structured speech or music content

### Configuration

```bash
aiperf profile \
  --model Qwen/Qwen2-Audio-7B-Instruct \
  --url localhost:8000 \
  --endpoint-type chat \
  --audio-length-mean 5.0 \
  --audio-length-stddev 1.0 \
  --audio-sample-rates 16 \
  --audio-depths 16 \
  --audio-format wav \
  --audio-num-channels 1 \
  --audio-batch-size 1 \
  --synthetic-input-tokens-mean 50 \
  --request-count 10
```

**Options:**
- `--audio-length-mean`: Mean duration in seconds (default: 0.0)
- `--audio-length-stddev`: Duration standard deviation (default: 0.0)
- `--audio-sample-rates`: List of sample rates in kHz to randomly select from (default: `[16.0]`)
- `--audio-depths`: List of bit depths (8, 16, 24, 32) to randomly select from (default: `[16]`)
- `--audio-format`: `wav` or `mp3` (default: `wav`)
- `--audio-num-channels`: 1 (mono) or 2 (stereo) (default: 1)
- `--audio-batch-size`: Number of audio files per request (default: 1)

**Note**: Set `--audio-length-mean` > 0 to enable audio generation. MP3 supports a limited set of sample rates; use WAV for custom rates.

---

## Video Generation

Video generation is fully documented in [Synthetic Video Generation](/aiperf/tutorials/model-endpoint-guides/synthetic-video-generation). Key points:

- **Synthesis Types**: `moving_shapes` (animated geometry), `grid_clock` (grid with animation), or `noise` (random pixels)
- **Codecs**: CPU (`libvpx-vp9`, `libx264`, `libx265`) or GPU (`h264_nvenc`, `hevc_nvenc`)
- **Formats**: WebM (default) or MP4

**Prerequisite:** Video generation requires FFmpeg. For installations, see [Synthetic Video Tutorial](/aiperf/tutorials/model-endpoint-guides/synthetic-video-generation#installing-ffmpeg).

```bash
aiperf profile \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --url localhost:8000 \
  --endpoint-type chat \
  --video-width 640 \
  --video-height 480 \
  --video-fps 4 \
  --video-duration 5.0 \
  --video-synth-type moving_shapes \
  --video-codec libvpx-vp9 \
  --output-tokens-mean 50 \
  --request-count 5
```

See [Synthetic Video Tutorial](/aiperf/tutorials/model-endpoint-guides/synthetic-video-generation) for complete details.