This guide covers benchmarking servers that implement the OpenAI Responses API (POST /v1/responses) using AIPerf.
The Responses API is OpenAI’s newer API primitive that replaces Chat Completions for new projects. It supports text, images, audio, streaming, and reasoning output.
AIPerf’s responses endpoint type handles the key differences between the Responses API and Chat Completions:
Launch an OpenAI Responses API-compatible server. For example, using a vLLM server:
Verify the server is ready:
Run AIPerf against the Responses API endpoint using synthetic inputs:
Sample Output:
Create a JSONL input file:
Run AIPerf:
In the Responses API, system instructions use a top-level instructions field rather than a system role message. AIPerf handles this mapping automatically when you use --shared-system-prompt-length to generate a synthetic system prompt:
This generates a synthetic system prompt of approximately 50 tokens and places it in the "instructions" field of the Responses API payload, rather than adding a system message to the input array. The same prompt is shared across all requests in the session.
Profile vision-capable models with synthetic images:
Image inputs are formatted as {"type": "input_image", "image_url": "<url>"} in the Responses API (compared to {"type": "image_url", "image_url": {"url": "<url>"}} in Chat Completions).
Profile audio-capable models with the Responses API:
Audio inputs are formatted as {"type": "input_audio", "input_audio": {"data": "<base64>", "format": "<fmt>"}}, the same structure used by Chat Completions.
See the Audio tutorial for details on audio input configuration and supported formats.
Run without streaming to get full responses:
Without --streaming, time-to-first-token (TTFT) and inter-token latency (ITL) metrics are not available. Use streaming mode for the most detailed latency breakdown.
Control load generation the same way as other endpoint types:
Benchmark multi-turn conversations using the Responses API:
See the Multi-Turn Conversations tutorial for details on conversation control parameters.
Use server-reported token counts instead of client-side tokenization:
When --use-server-token-count is enabled with streaming, AIPerf automatically sets stream_options.include_usage in the request payload to receive usage data in the response.completed event.
Pass additional API parameters using --extra-inputs:
When migrating AIPerf benchmarks from --endpoint-type chat to --endpoint-type responses:
--endpoint-type chat to --endpoint-type responses--endpoint /v1/chat/completions to --endpoint /v1/responses--use-legacy-max-tokens flag is not applicable (the Responses API always uses max_output_tokens)--streaming, --concurrency, --extra-inputs, etc.) work the same wayFor reference, AIPerf processes these Responses API streaming events:
This enables accurate measurement of TTFT, ITL, and token throughput metrics when streaming is enabled.