AIPerf supports benchmarking Automatic Speech Recognition (ASR) models using publicly available speech datasets from HuggingFace. Each dataset entry sends real speech audio alongside a fixed “Transcribe this audio.” prompt to measure end-to-end transcription latency and throughput.
Five ASR datasets are built in:
Clips longer than 30 seconds are automatically skipped to stay within typical ASR model context limits.
Launch vLLM with an audio-capable model such as Qwen2-Audio:
Verify the server is ready:
LibriSpeech is the standard read-speech benchmark and requires no authentication:
Sample Output:
High TTFT variance is expected for ASR workloads — audio encoding time scales with clip duration. Clips vary in length (up to 30 seconds) so TTFT will vary across requests.
VoxPopuli contains European Parliament recordings and requires no authentication:
AMI contains meeting recordings with individual headset microphone audio and requires no authentication:
GigaSpeech is a multi-domain corpus covering audiobooks, podcasts, and YouTube. It requires a HuggingFace account and acceptance of the dataset terms:
SPGISpeech contains financial earnings call recordings. It requires a HuggingFace account and acceptance of the dataset terms: