For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Profile OpenAI-Compatible Text APIs Using AIPerf
      • Profile the OpenAI Responses API with AIPerf
      • Profile Hugging Face TGI Models with AIPerf
      • Profile Vision Language Models with AIPerf
      • Profile Audio Language Models with AIPerf
      • Profile ASR Models with Public Datasets
      • Profile Embedding Models with AIPerf
      • Profile Ranking Models with AIPerf
      • Profile NIM Image Retrieval with AIPerf
      • SGLang Image Generation
      • SGLang Image Edit
      • SGLang Video Generation
      • Synthetic Video Generation
      • Template Endpoint
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Start a vLLM Server
  • Profile with LibriSpeech
  • Profile with VoxPopuli
  • Profile with AMI
  • Profile with GigaSpeech
  • Profile with SPGISpeech
TutorialsModel & Endpoint Guides

Profile ASR Models with Public Datasets

||View as Markdown|
Previous

Profile Audio Language Models with AIPerf

Next

Profile Embedding Models with AIPerf

AIPerf supports benchmarking Automatic Speech Recognition (ASR) models using publicly available speech datasets from HuggingFace. Each dataset entry sends real speech audio alongside a fixed “Transcribe this audio.” prompt to measure end-to-end transcription latency and throughput.

Five ASR datasets are built in:

Dataset--public-datasetAuth RequiredDescription
LibriSpeechlibrispeechNoRead speech from audiobooks (test split)
VoxPopulivoxpopuliNoEuropean Parliament recordings (Meta)
GigaSpeechgigaspeechYesMulti-domain corpus: audiobooks, podcasts, YouTube
AMIamiNoMeeting recordings with individual headset microphone audio
SPGISpeechspgispeechYesFinancial earnings call recordings (Kensho)

Clips longer than 30 seconds are automatically skipped to stay within typical ASR model context limits.


Start a vLLM Server

Launch vLLM with an audio-capable model such as Qwen2-Audio:

$docker build -t vllm-audio - << 'EOF'
$FROM vllm/vllm-openai:latest
$RUN pip install 'vllm[audio]'
$EOF
$
$docker run --gpus all -p 8000:8000 vllm-audio \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --trust-remote-code

Verify the server is ready:

$curl -s localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen2-Audio-7B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Profile with LibriSpeech

LibriSpeech is the standard read-speech benchmark and requires no authentication:

$aiperf profile \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset librispeech \
> --request-count 10 \
> --concurrency 4

Sample Output:

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Time to First Token │ 31,838.96 │ 10,246.67 │ 51,101.07 │ 50,755.33 │ 47,643.67 │ 34,232.54 │ 11,974.17 │
│ (ms) │ │ │ │ │ │ │ │
│ Time to Second Token │ 1,612.02 │ 348.35 │ 11,892.33 │ 10,868.74 │ 1,656.45 │ 468.33 │ 3,427.09 │
│ (ms) │ │ │ │ │ │ │ │
│ Time to First Output │ 31,838.96 │ 10,246.67 │ 51,101.07 │ 50,755.33 │ 47,643.67 │ 34,232.54 │ 11,974.17 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Request Latency (ms) │ 118,587.60 │ 25,920.68 │ 293,970.87 │ 281,381.83 │ 168,080.54 │ 110,066.27 │ 73,231.75 │
│ Inter Token Latency │ 2,306.51 │ 187.13 │ 4,102.92 │ 4,051.39 │ 3,587.58 │ 2,498.81 │ 1,079.49 │
│ (ms) │ │ │ │ │ │ │ │
│ Output Token │ 0.95 │ 0.24 │ 5.34 │ 4.95 │ 1.42 │ 0.40 │ 1.48 │
│ Throughput Per User │ │ │ │ │ │ │ │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ E2E Output Token │ 0.37 │ 0.19 │ 0.90 │ 0.87 │ 0.59 │ 0.29 │ 0.20 │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ Output Sequence │ 37.10 │ 10.00 │ 78.00 │ 75.21 │ 50.10 │ 39.50 │ 18.10 │
│ Length (tokens) │ │ │ │ │ │ │ │
│ Input Sequence Length │ 5.00 │ 5.00 │ 5.00 │ 5.00 │ 5.00 │ 5.00 │ 0.00 │
│ (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 1.24 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 0.03 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└───────────────────────┴────────────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┘

High TTFT variance is expected for ASR workloads — audio encoding time scales with clip duration. Clips vary in length (up to 30 seconds) so TTFT will vary across requests.


Profile with VoxPopuli

VoxPopuli contains European Parliament recordings and requires no authentication:

$aiperf profile \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset voxpopuli \
> --request-count 10 \
> --concurrency 4

Profile with AMI

AMI contains meeting recordings with individual headset microphone audio and requires no authentication:

$aiperf profile \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset ami \
> --request-count 10 \
> --concurrency 4

Profile with GigaSpeech

GigaSpeech is a multi-domain corpus covering audiobooks, podcasts, and YouTube. It requires a HuggingFace account and acceptance of the dataset terms:

$uv run hf auth login
$
$aiperf profile \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset gigaspeech \
> --request-count 10 \
> --concurrency 4

Profile with SPGISpeech

SPGISpeech contains financial earnings call recordings. It requires a HuggingFace account and acceptance of the dataset terms:

$uv run hf auth login
$
$aiperf profile \
> --model Qwen/Qwen2-Audio-7B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset spgispeech \
> --request-count 10 \
> --concurrency 4