Profile ASR Models with Public Datasets
AIPerf supports benchmarking Automatic Speech Recognition (ASR) models using publicly available speech datasets from HuggingFace. Each dataset entry sends real speech audio alongside a fixed “Transcribe this audio.” prompt to measure end-to-end transcription latency and throughput.
Five ASR datasets are built in:
Clips longer than 30 seconds are automatically skipped to stay within typical ASR model context limits.
Start a vLLM Server
Launch vLLM with an audio-capable model such as Qwen2-Audio:
Verify the server is ready:
Profile with LibriSpeech
LibriSpeech is the standard read-speech benchmark and requires no authentication:
Sample Output:
High TTFT variance is expected for ASR workloads — audio encoding time scales with clip duration. Clips vary in length (up to 30 seconds) so TTFT will vary across requests.
Profile with VoxPopuli
VoxPopuli contains European Parliament recordings and requires no authentication:
Profile with AMI
AMI contains meeting recordings with individual headset microphone audio and requires no authentication:
Profile with GigaSpeech
GigaSpeech is a multi-domain corpus covering audiobooks, podcasts, and YouTube. It requires a HuggingFace account and acceptance of the dataset terms:
Profile with SPGISpeech
SPGISpeech contains financial earnings call recordings. It requires a HuggingFace account and acceptance of the dataset terms: