Migrating from GenAI-Perf
Migrating from GenAI-Perf
Migrating from GenAI-Perf
AIPerf is designed to be a drop-in replacement for GenAI-Perf for currently supported features. Most options from GenAI-Perf map directly to AIPerf options. Options that don’t are noted below.
Some options, primarily for the analyze subcommand, are not yet supported; they’re planned for future releases.
See the GenAI-Perf vs AIPerf CLI Feature Comparison Matrix for a detailed comparison of the supported CLI options.
--max-threads: You no longer need to set a max-thread option. Previously, this was a global setting to control GenAI-Perf total thread count.
AIPerf provides more-fine grained control of the number of workers issuing requests to the endpoint by using the --workers-max option.--: The passthrough args flag is no longer required. All options are now natively supported by AIPerf.To migrate your previous GenAI-Perf commands to AIPerf commands, remove the above options.
--server-metrics-url → --gpu-telemetry (Not --server-metrics)GenAI-Perf’s --server-metrics-url is misleadingly named. Despite the “server metrics” label, the flag points GenAI-Perf at a Triton / DCGM telemetry endpoint (GPU power, utilization, memory) — it is not a general Prometheus inference-server metrics scraper.
AIPerf splits the concern into two clearly-scoped flags:
--gpu-telemetry — GPU telemetry collection. Supports both the DCGM exporter HTTP endpoint (default; localhost:9400 + localhost:9401) and the local pynvml library (pass pynvml). Custom DCGM exporter URLs and a dashboard realtime view are also accepted.--server-metrics — Prometheus inference-server metrics from the model endpoint (base_url + /metrics). Enabled by default; pass additional URLs to scrape extra Prometheus targets.Porting rule: --server-metrics-url http://node:9400 ⇒ AIPerf --gpu-telemetry http://node:9400. Do not map it to --server-metrics — that would target the inference endpoint’s Prometheus exporter, which is a different surface.
The format for the inputs.json file, which contains the input prompts used in benchmarking, has changed slightly from GenAI-Perf to AIPerf:
payload → payloads: The singular payload field has been renamed to payloads (plural).payloads array now represents a turn in a conversation/session, providing better support for multi-turn interactions.session_id field: A new session_id field has been added to each entry. This enables correlation between requests and payloads for future analytics and tracking purposes.These changes allow AIPerf to better handle conversational workloads and provide more detailed traceability for performance analysis.
Modern language models with reasoning capabilities such as openai/gpt-oss-120b, DeepSeek-R1, Qwen3, or similar, generate reasoning tokens before producing their final response. These reasoning tokens are typically returned in the reasoning_content field of the API response and represent the model’s internal thought process.
GenAI-Perf does not parse or process reasoning tokens. Content in the reasoning_content field is ignored, which means GenAI-Perf waits until the first non-reasoning output token is generated before recording the Time to First Token (TTFT).
AIPerf fully supports parsing and processing of reasoning tokens. The TTFT metric captures the time to generate the first token of any type, whether it’s a reasoning token or an output token. Additionally, AIPerf introduces a new metric: Time to First Output Token (TTFO), which measures the time to the first non-reasoning output token, equivalent to GenAI-Perf’s TTFT.
When comparing benchmark results between the two tools for reasoning-capable models:
When migrating from GenAI-Perf, use AIPerf TTFO to compare against GenAI-Perf TTFT for equivalent measurements of reasoning-capable models.
By providing both TTFT and TTFO metrics, AIPerf enables more comprehensive performance analysis of reasoning-capable models by offering complete visibility into the token generation timeline.
When migrating from GenAI-Perf, use AIPerf Output Token Count to compare against GenAI-Perf OSL for equivalent measurements of reasoning-capable models.
By providing OSL, Reasoning Token Count, and Output Token Count metrics, AIPerf enables more comprehensive performance analysis of reasoning-capable models by providing a complete picture of the token generation process.