Profile the OpenAI Responses API with AIPerf
This guide covers benchmarking servers that implement the OpenAI Responses API (POST /v1/responses) using AIPerf.
The Responses API is OpenAI’s newer API primitive that replaces Chat Completions for new projects. It supports text, images, audio, streaming, and reasoning output.
Overview
AIPerf’s responses endpoint type handles the key differences between the Responses API and Chat Completions:
Start a Server
Launch an OpenAI Responses API-compatible server. For example, using a vLLM server:
Verify the server is ready:
Profile with Synthetic Inputs
Run AIPerf against the Responses API endpoint using synthetic inputs:
Sample Output:
Profile with Custom Input Files
Create a JSONL input file:
Run AIPerf:
System Instructions
In the Responses API, system instructions use a top-level instructions field rather than a system role message. AIPerf handles this mapping automatically when you use --shared-system-prompt-length to generate a synthetic system prompt:
This generates a synthetic system prompt of approximately 50 tokens and places it in the "instructions" field of the Responses API payload, rather than adding a system message to the input array. The same prompt is shared across all requests in the session.
Vision (Image Inputs)
Profile vision-capable models with synthetic images:
Image inputs are formatted as {"type": "input_image", "image_url": "<url>"} in the Responses API (compared to {"type": "image_url", "image_url": {"url": "<url>"}} in Chat Completions).
Audio Inputs
Profile audio-capable models with the Responses API:
Audio inputs are formatted as {"type": "input_audio", "input_audio": {"data": "<base64>", "format": "<fmt>"}}, the same structure used by Chat Completions.
See the Audio tutorial for details on audio input configuration and supported formats.
Non-Streaming Mode
Run without streaming to get full responses:
Without --streaming, time-to-first-token (TTFT) and inter-token latency (ITL) metrics are not available. Use streaming mode for the most detailed latency breakdown.
Concurrency and Rate Control
Control load generation the same way as other endpoint types:
Multi-Turn Conversations
Benchmark multi-turn conversations using the Responses API:
See the Multi-Turn Conversations tutorial for details on conversation control parameters.
Server Token Counts
Use server-reported token counts instead of client-side tokenization:
When --use-server-token-count is enabled with streaming, AIPerf automatically sets stream_options.include_usage in the request payload to receive usage data in the response.completed event.
Extra Parameters
Pass additional API parameters using --extra-inputs:
Key Differences from Chat Completions
When migrating AIPerf benchmarks from --endpoint-type chat to --endpoint-type responses:
- Change
--endpoint-type chatto--endpoint-type responses - Change
--endpoint /v1/chat/completionsto--endpoint /v1/responses - The
--use-legacy-max-tokensflag is not applicable (the Responses API always usesmax_output_tokens) - All other AIPerf flags (
--streaming,--concurrency,--extra-inputs, etc.) work the same way
Streaming Event Handling
For reference, AIPerf processes these Responses API streaming events:
This enables accurate measurement of TTFT, ITL, and token throughput metrics when streaming is enabled.