AIPerf supports benchmarking embedding models that convert text into dense vector representations.
This guide covers profiling OpenAI-compatible embedding endpoints using vLLM.
Launch a vLLM server with an embedding model:
Verify the server is ready:
Run AIPerf against the embeddings endpoint using synthetic inputs:
Sample Output (Successful Run):
Embeddings endpoints return metrics focused on request latency and throughput. No token-level metrics (TTFT, ITL) since embeddings return a single vector per request.
Create a JSONL embeddings input file and run AIPerf against it. The two
steps are combined into a single bash block so the test-docs CI actually
exercises the aiperf profile invocation — the runner extracts the first
bash block after the tag, so a split would leave the profile command
unrun.
Sample Output (Successful Run):
When using custom inputs, AIPerf uses your actual text samples instead of synthetic data. The input sequence lengths will vary based on your actual text content.