Custom Dataset Guide
Benchmark LLMs with your own data using single-turn requests, multi-turn conversations, or random sampling.
Overview
AIPerf supports three custom dataset types for benchmarking with your own data:
All three support:
- Client-side batching
- Automatic media handling: local files are converted to base64 format, while remote URLs are sent directly to the API
Server Setup
Start a vLLM server for testing:
Verify the server is ready:
Single-Turn Datasets
Each line represents one independent single-turn request.
When to Use
Use single_turn when you need deterministic, sequential execution where requests always run in the exact order they appear in the file:
- Debugging: Test specific prompts in a known sequence
- Regression testing: Same input file → same output order every time
- Timing control: Schedule requests with precise timestamps or delays
- Predictable testing: Know exactly which request runs when
Execution: Sequential by default (request 1, then 2, then 3, etc.) Input: Single JSONL file only
Basic Text Example
Output:
Multi-Turn Datasets
Each entry represents a complete conversation with multiple turns.
When to Use
Use multi_turn when you need conversations with context where each turn builds on previous turns in the conversation:
- Chat testing: Test conversational AI that maintains context across turns
- Realistic interactions: Simulate real user conversations with follow-up questions
- Task completion: Test multi-step tasks that require conversation history
Execution: Sequential within each conversation (turn 1, then 2, then 3, etc.), but multiple conversations run concurrently Input: Single JSONL file only
Basic Conversation
Output:
Key Points:
- Each turn includes full conversation history
- Turns execute sequentially within each conversation
- Multiple conversations run concurrently (up to
--concurrency)
Random Pool Datasets
Randomly sample from one or more data pools for varied request patterns.
When to Use
Use random_pool when you need random sampling with replacement for unpredictable, varied request patterns:
- Load testing: Generate diverse request patterns with variety
- Production simulation: Model real-world workloads where requests vary
- Stress testing: Test system behavior under mixed input patterns
- Multiple data sources: Combine files from a directory (each file becomes a pool)
Execution: Random sampling with replacement (same entry can be selected multiple times) Input: Single JSONL file OR directory of multiple JSONL files Note: Does NOT support timing control or multi-turn conversations
Basic Single-File Sampling
Output:
Behavior:
- Randomly samples 50 requests from 8-entry pool
- Sampling with replacement (entries can repeat)
- Use
--random-seedfor reproducibility
Related
- Multi-Turn Conversations - Multi-turn conversation benchmarking
- Conversation Context Mode - How conversation history accumulates in multi-turn