Migrating from GenAI-Perf | NVIDIA AIPerf Documentation

AIPerf is designed to be a drop-in replacement for GenAI-Perf for currently supported features. Most options from GenAI-Perf map directly to AIPerf options. Options that don’t are noted below. Some options, primarily for the analyze subcommand, are not yet supported; they’re planned for future releases.

See the GenAI-Perf vs AIPerf CLI Feature Comparison Matrix for a detailed comparison of the supported CLI options.

Known CLI Argument Differences

--max-threads: You no longer need to set a max-thread option. Previously, this was a global setting to control GenAI-Perf total thread count. AIPerf provides more-fine grained control of the number of workers issuing requests to the endpoint by using the --workers-max option.
--: The passthrough args flag is no longer required. All options are now natively supported by AIPerf.

To migrate your previous GenAI-Perf commands to AIPerf commands, remove the above options.

`--server-metrics-url` → `--gpu-telemetry` (Not `--server-metrics`)

GenAI-Perf’s --server-metrics-url is misleadingly named. Despite the “server metrics” label, the flag points GenAI-Perf at a Triton / DCGM telemetry endpoint (GPU power, utilization, memory) — it is not a general Prometheus inference-server metrics scraper.

AIPerf splits the concern into two clearly-scoped flags:

--gpu-telemetry — GPU telemetry collection. Supports both the DCGM exporter HTTP endpoint (default; localhost:9400 + localhost:9401) and the local pynvml library (pass pynvml). Custom DCGM exporter URLs and a dashboard realtime view are also accepted.
--server-metrics — Prometheus inference-server metrics from the model endpoint (base_url + /metrics). Enabled by default; pass additional URLs to scrape extra Prometheus targets.

Porting rule: --server-metrics-url http://node:9400 ⇒ AIPerf --gpu-telemetry http://node:9400. Do not map it to --server-metrics — that would target the inference endpoint’s Prometheus exporter, which is a different surface.

Input File Format Changes

The format for the inputs.json file, which contains the input prompts used in benchmarking, has changed slightly from GenAI-Perf to AIPerf:

payload → payloads: The singular payload field has been renamed to payloads (plural).
Conversation turns: Each item in the payloads array now represents a turn in a conversation/session, providing better support for multi-turn interactions.
session_id field: A new session_id field has been added to each entry. This enables correlation between requests and payloads for future analytics and tracking purposes.

These changes allow AIPerf to better handle conversational workloads and provide more detailed traceability for performance analysis.

Reasoning Tokens and Differences in Metrics

Modern language models with reasoning capabilities such as openai/gpt-oss-120b, DeepSeek-R1, Qwen3, or similar, generate reasoning tokens before producing their final response. These reasoning tokens are typically returned in the reasoning_content field of the API response and represent the model’s internal thought process.

Behavioral Differences

GenAI-Perf does not parse or process reasoning tokens. Content in the reasoning_content field is ignored, which means GenAI-Perf waits until the first non-reasoning output token is generated before recording the Time to First Token (TTFT).

AIPerf fully supports parsing and processing of reasoning tokens. The TTFT metric captures the time to generate the first token of any type, whether it’s a reasoning token or an output token. Additionally, AIPerf introduces a new metric: Time to First Output Token (TTFO), which measures the time to the first non-reasoning output token, equivalent to GenAI-Perf’s TTFT.

Impact on Metrics

When comparing benchmark results between the two tools for reasoning-capable models:

Time to First Token (TTFT)

When migrating from GenAI-Perf, use AIPerf TTFO to compare against GenAI-Perf TTFT for equivalent measurements of reasoning-capable models.

AIPerf TTFT measures time to the first token of any type (including reasoning tokens) from the start of the request, and will be lower than GenAI-Perf TTFT
GenAI-Perf TTFT measures time to the first non-reasoning output token from the start of the request, and will be higher than AIPerf TTFT
AIPerf TTFO is equivalent to GenAI-Perf TTFT

By providing both TTFT and TTFO metrics, AIPerf enables more comprehensive performance analysis of reasoning-capable models by offering complete visibility into the token generation timeline.

Output Sequence Length (OSL)

When migrating from GenAI-Perf, use AIPerf Output Token Count to compare against GenAI-Perf OSL for equivalent measurements of reasoning-capable models.

AIPerf OSL includes both reasoning and output tokens, so it will be higher than GenAI-Perf OSL
GenAI-Perf OSL excludes reasoning tokens from the count, so it will be lower than AIPerf OSL
AIPerf Output Token Count is equivalent to GenAI-Perf OSL, as it excludes reasoning tokens from the count
AIPerf Reasoning Token Count is exclusive to AIPerf, as it includes reasoning tokens only

By providing OSL, Reasoning Token Count, and Output Token Count metrics, AIPerf enables more comprehensive performance analysis of reasoning-capable models by providing a complete picture of the token generation process.