YAML Configuration Files | NVIDIA AIPerf Documentation

Overview

AIPerf can be driven entirely from a single YAML file instead of a long string of CLI flags. The YAML format is more readable, easier to version-control, and unlocks features that have no CLI equivalent — sweeps, multi-run aggregation, environment variable substitution, and computed values.

This tutorial walks through what a config file looks like, how to grow it from a tiny example to a full sweep, and how it compares to running everything through aiperf profile flags.

You don’t need to choose between the two: CLI flags still work, and they layer on top of a YAML file when you pass both.

Why use a YAML config?

A typical concurrency sweep on the command line looks like this:

$ aiperf profile \
>   --model meta-llama/Llama-3.1-8B-Instruct \
>   --url http://localhost:8000/v1/chat/completions \
>   --endpoint-type chat --streaming \
>   --synthetic-input-tokens-mean 512 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 128 --output-tokens-stddev 0 \
>   --concurrency 8,16,32,64 \
>   --request-count 500 \
>   --warmup-request-count 50 \
>   --artifact-dir ./artifacts/my-test

The same run as a YAML file:

1 # benchmark.yaml
2 schemaVersion: "2.0"
3 
4 benchmark:
5   model: meta-llama/Llama-3.1-8B-Instruct
6   endpoint:
7     url: http://localhost:8000/v1/chat/completions
8     type: chat
9     streaming: true
10   dataset:
11     type: synthetic
12     entries: 500
13     prompts: {isl: 512, osl: 128}
14   phases:
15     - {name: warmup, type: concurrency, concurrency: 8, requests: 50, exclude_from_results: true}
16     - {name: profiling, type: concurrency, requests: 500}
17   artifacts:
18     dir: ./artifacts/my-test
19 
20 sweep:
21   type: grid
22   parameters:
23     concurrency: [8, 16, 32, 64]

Run it with:

$ aiperf profile --config benchmark.yaml

Note the two 500s map to different things. The CLI’s --request-count 500 is the stop condition — keep firing until 500 requests complete — and corresponds to phases.profiling.requests: 500. The dataset.entries: 500 is the dataset size — how many unique synthetic prompts to generate up front — and has no CLI shorthand; it’s recycled across requests if the phase runs longer than the dataset. They happen to share a value here but tune independently.

What you gain over the flag form:

It’s all in one place and you can comment it. No more lost shell history.
Sweeps are first-class. Grid, lockstep zip, named scenarios, and quasi-random search all work out of the box.
You can substitute values from environment variables (${VAR:default}) and compute values with simple expressions ({{ var * 2 }}).
Editors validate as you type if you wire up the bundled JSON Schema.
Errors are kinder. Misspelled keys produce a “did you mean…?” hint instead of being silently ignored.

Your first config — five lines that actually work

The smallest legal config is short:

1 # minimal.yaml
2 schemaVersion: "2.0"
3 
4 benchmark:
5   model: meta-llama/Llama-3.1-8B-Instruct
6   endpoint:
7     url: http://localhost:8000
8   dataset:
9     type: synthetic
10     entries: 100
11     prompts: {isl: 512, osl: 128}
12   phases:
13     type: concurrency
14     concurrency: 8
15     requests: 100

Then:

$ aiperf profile --config minimal.yaml

That’s a complete benchmark — model, endpoint, dataset, and one profiling phase. The endpoint path (/v1/chat/completions) is auto-detected from endpoint.type (defaulting to chat).

You can scaffold this exact file from the bundle without typing it:

$ aiperf config init --template minimal --output minimal.yaml

aiperf config init --list prints every bundled template, grouped by category.

Anatomy of a config

A YAML config has two layers:

1 # --- envelope (cross-run knobs) ---
2 schemaVersion: "2.0"
3 random_seed: 42
4 variables: {...}
5 sweep: {...}
6 multi_run: {...}
7 
8 # --- benchmark body (the workload itself) ---
9 benchmark:
10   model: ...
11   endpoint: {...}
12   dataset: {...}
13   phases: [...]

The envelope holds settings that apply across runs — sweep definitions, multi-run aggregation, the random seed, and reusable variables.

The benchmark: body holds everything that defines a single benchmark workload. When a sweep is active, this body is what gets varied across runs.

Shorthand vs named forms

Short configs use singular keys. Bigger configs use plural lists with names:

1 # Shorthand — fastest to read for simple cases
2 benchmark:
3   model: meta-llama/Llama-3.1-8B-Instruct
4   dataset: {type: synthetic, prompts: {isl: 512, osl: 128}}
5   phases: {type: concurrency, concurrency: 8, requests: 100}

1 # Named — clearer for phases or models; datasets are currently limited to one entry
2 benchmark:
3   models: [meta-llama/Llama-3.1-8B-Instruct]
4   datasets:
5     - {name: main, type: synthetic, prompts: {isl: 512, osl: 128}}
6   phases:
7     - {name: warmup, type: concurrency, concurrency: 4, requests: 50, exclude_from_results: true}
8     - {name: profiling, type: poisson, rate: 30.0, duration: 120}

You can mix and match — the loader auto-expands model: into a one-element models: list, dataset: into a one-entry datasets: list named default, and a flat phases: block into a one-element list named profiling. The normalized datasets: form is future-facing but currently accepts exactly one dataset; multiple datasets are a roadmap item.

Inline datasets

Instead of pointing at a prompts.jsonl file with dataset.path:, you can embed records directly in the YAML:

1 benchmark:
2   dataset:
3     type: file
4     format: single_turn
5     records:
6       - {text: "What is machine learning?"}
7       - {text: "Explain GANs.", output_length: 200}

Useful for shareable repros, k8s ConfigMaps, and small regression fixtures. See Inline Datasets for full coverage including multi-turn, random_pool (with multi-pool dict-of-lists), and mooncake_trace examples.

Both naming styles work

AIPerf accepts either snake_case or camelCase for any field. These two are equivalent:

1 multi_run: {num_runs: 3, cooldown_seconds: 15.0}

1 multiRun: {numRuns: 3, cooldownSeconds: 15.0}

Pick one and stick with it within a file.

Editor autocomplete and validation

A bundled JSON Schema gives you autocomplete, type-checking, and inline docs in any editor that speaks YAML language server (VS Code, JetBrains, Vim/Neovim with coc-yaml, Helix, etc.). The schema lives at src/aiperf/config/schema/aiperf-config.schema.json in the AIPerf repo. Copy or symlink it next to your config and point your editor at it with a relative path:

1 # yaml-language-server: $schema=./aiperf-config.schema.json

Now the editor will:

Suggest valid keys as you type.
Underline misspelled fields in red.
Show field descriptions on hover.
Catch type errors (e.g. setting concurrency: "eight" instead of 8).

If your editor already has a workspace mapping for **/aiperf-config.yaml or **/benchmark.yaml, you can skip the header. See src/aiperf/config/schema/README.md for VS Code workspace and IntelliJ configuration examples.

Helpful errors when you typo

Top-level envelope keys reject unknown names with a “did you mean” hint. Writing sweeps: instead of sweep: produces:

Unknown top-level envelope key(s): 'sweeps' (did you mean 'sweep'?). Known keys: ['benchmark', 'multiRun', 'multi_run', 'noSweepTable', 'no_sweep_table', 'plot', 'randomSeed', 'random_seed', 'schemaVersion', 'schema_version', 'sweep', 'variables']

Inside the benchmark: body and inside sweep parameter paths, every section is set to reject unknown fields outright. A typo’d sweep parameter like phases.profiling.concurency (one r) is caught at validate time — aiperf config validate runs the same sweep-expansion pipeline profile does and surfaces the error before any compute is spent:

ValidationError: 1 validation error for BenchmarkConfig
phases.0.concurrency.profiling
  Extra inputs are not permitted [type=extra_forbidden, input_value={'concurency': 8}, ...]

Use aiperf config validate <file> for routine linting. Use aiperf config expand <file> when you want to preview the actual variations a sweep will produce (see below). Both catch sweep-path typos; expand additionally renders the variation list.

Substituting environment variables

Use ${VAR} for required values and ${VAR:default} for optional ones:

1 benchmark:
2   model: ${MODEL_NAME:meta-llama/Llama-3.1-8B-Instruct}
3   endpoint:
4     url: ${INFERENCE_URL:http://localhost:8000/v1/chat/completions}
5     api_key: ${OPENAI_API_KEY}             # required, errors if unset
6     timeout: ${TIMEOUT:600.0}

Run it across deployments without editing the file:

$ MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct \
> INFERENCE_URL=http://prod.example.com:8000/v1/chat/completions \
> OPENAI_API_KEY=sk-... \
> aiperf profile --config benchmark.yaml

Strings are auto-coerced to the right type — TIMEOUT=600.0 becomes a float, STREAMING=true becomes a bool.

If a required ${VAR} is unset, you get a clean error naming the variable, not a silent fallback.

Reusable values and computed expressions

Define values once at the top, reference them anywhere with {{ }} Jinja expressions:

1 variables:
2   base_concurrency: 16
3   isl_target: 512
4   test_duration: 120
5 
6 benchmark:
7   dataset:
8     type: synthetic
9     entries: "{{ base_concurrency * 10 }}"      # 160
10     prompts:
11       isl:
12         mean: "{{ isl_target }}"
13         stddev: "{{ isl_target // 10 }}"        # 51 (integer division)
14   phases:
15     - name: warmup
16       type: concurrency
17       concurrency: "{{ base_concurrency // 2 }}"  # 8
18       requests: "{{ base_concurrency * 5 }}"      # 80
19       exclude_from_results: true
20     - name: profiling
21       type: gamma
22       rate: 30
23       duration: "{{ test_duration }}"
24       concurrency: "{{ base_concurrency * 4 }}"   # 64
25       rate_ramp: "{{ test_duration // 4 }}"       # 30

A few things worth knowing:

Variables can reference other variables, in any order. AIPerf resolves the dependency graph for you.
Typos like {{ base_concurrancy }} raise an error immediately — they don’t silently render as an empty string.
Numeric strings like "42" and "3.14" are coerced to int/float automatically, so you don’t have to remember which fields expect numbers.
Env vars run before Jinja, so you can do entries: "{{ base * '${MULT:10}' | int }}".

Multiple phases in one file

A typical benchmark is a quick warmup followed by the real measurement. CLI warmup flags are limited to scalar values per phase shape (--warmup-request-count, --warmup-duration, --warmup-concurrency, --warmup-request-rate, --warmup-arrival-pattern, and a handful of ramp/grace-period siblings). YAML lets you describe warmup as a full phase with all the same fields available to profiling:

1 benchmark:
2   phases:
3     - name: warmup
4       type: concurrency
5       concurrency: 8
6       requests: 50
7       exclude_from_results: true   # don't pollute the report
8 
9     - name: profiling
10       type: poisson
11       rate: 30.0
12       duration: 120
13       concurrency: 64
14       grace_period: 60             # finish in-flight requests after duration

Each phase is a complete arrival pattern in its own right, with its own concurrency, duration, and arrival shape (concurrency, constant, poisson, gamma, fixed_schedule, user_centric, …).

Adaptive scale in YAML

Adaptive scale is for single-run boundary discovery. Instead of launching a sweep or separate search trials, AIPerf runs one profiling phase, starts at a low control value, evaluates SLA windows, ramps up while every SLA filter passes, and then sustains near the discovered boundary.

The canonical adaptive shape uses a nested control block. Supported v1 control variables are concurrency, prefill_concurrency, request_rate, and users. Existing flat concurrency fields such as control_variable: concurrency and min_concurrency are still accepted as compatibility aliases, but new configs should use control.variable, control.min, and control.max.

1 schemaVersion: "2.0"
2 
3 benchmark:
4   model: meta-llama/Llama-3.1-8B-Instruct
5   endpoint:
6     url: http://localhost:8000/v1/chat/completions
7     type: chat
8     streaming: true
9   dataset:
10     type: synthetic
11     entries: 1000
12     prompts: {isl: 512, osl: 128}
13   phases:
14     - name: profiling
15       type: concurrency
16       concurrency: 200
17       prefill_concurrency: 64
18       duration: 3600
19       adaptive_scale:
20         enabled: true
21         control:
22           variable: prefill_concurrency
23           min: 1
24           max: 64
25         assessment_period: 60
26         min_completed_requests: 20
27         sustain_duration: 1800
28         strategy:
29           type: ramp_until_fail
30           step_policy: sla_margin
31           base_step: 10
32           max_step_multiplier: 4
33       sla:
34         request_latency:
35           p95:
36             le: 30000
37         error_rate:
38           avg:
39             le: 0.01

Adaptive scale rejects fixed ramps on the same variable it controls. For example, do not combine control.variable: prefill_concurrency with prefill_ramp. Fixed ramps for other variables are allowed.

The CLI exposes a compact sweep-like control flag for the common single-phase case: --adaptive-scale-control variable:min,max:type, plus repeated --adaptive-scale-sla metric:stat:op:threshold flags. For example: --adaptive-scale-control "concurrency:1,1000:int" --adaptive-scale-sla "request_latency:p95:le:30000". Expanded --adaptive-control-variable, --adaptive-control-min, and --adaptive-control-max flags remain supported for advanced scripting; if expanded --adaptive-control-max is omitted, AIPerf infers it from the matching phase target such as --concurrency, --prefill-concurrency, --request-rate, or --num-users. Do not mix compact and expanded control forms.

Adaptive scale combines SLA filters with simple AND semantics. A window passes only when every configured filter passes. Step sizing uses the smallest normalized passing margin, so the closest SLA boundary controls the next increase. There are no weights, formulas, or multi-objective scoring in single-run adaptive scale.

For lower-is-better SLA filters such as latency, TTFT, error rate, or cancellation rate, use lt or le:

1 sla:
2   time_to_first_token:
3     p95:
4       le: 3000
5   cancellation_rate:
6     avg:
7       le: 0.05

For higher-is-better SLA filters such as throughput or adaptive-window success ratio, use gt or ge:

1 sla:
2   request_throughput:
3     avg:
4       ge: 80
5   success_rate:
6     avg:
7       ge: 0.95

Quality-qualified adaptive goodput is also higher-is-better, but must be paired with at least one per-request quality filter:

1 sla:
2   request_latency:
3     p95:
4       le: 30000
5   goodput:
6     avg:
7       ge: 20

For streaming token quality, pair goodput with TTFT and ITL filters:

1 sla:
2   ttft:
3     p95:
4       le: 2000
5   itl:
6     p95:
7       le: 100
8   goodput:
9     avg:
10       ge: 20

Adaptive SLA metric support

Timing thresholds use milliseconds. This table is the adaptive SLA metric/stat support matrix:

Metric family	Metric tags and aliases	Supported stats	Window semantics
E2E latency	`request_latency`	`avg`, `min`, `max`, `p1`, `p5`, `p10`, `p25`, `p50`, `p75`, `p90`, `p95`, `p99`	Per-request latency samples from successful requests.
Time to first token	`time_to_first_token`, `ttft`	`avg`, `min`, `max`, `p1`, `p5`, `p10`, `p25`, `p50`, `p75`, `p90`, `p95`, `p99`	Per-request TTFT samples from successful streaming requests. Missing TTFT samples fail TTFT SLA windows.
Inter-token latency	`inter_token_latency`, `itl`, `tpot`	`avg`, `min`, `max`, `p1`, `p5`, `p10`, `p25`, `p50`, `p75`, `p90`, `p95`, `p99`	Per-request ITL/TPOT samples from successful streaming requests. Missing ITL samples fail ITL SLA windows.
Request throughput	`throughput`, `request_throughput`, `completed_request_throughput`	`avg`, `min`, `max`	Successful completed requests per second in the adaptive window.
Output token throughput	`output_token_throughput`	`avg`, `min`, `max`	Output tokens from successful completed requests per second in the adaptive window.
Quality goodput	`goodput`	`avg`, `min`, `max`	Quality-passing successful requests per second. Requires at least one request-latency, TTFT, or ITL quality filter.
Goodput ratio	`goodput_ratio`	`avg`, `min`, `max`	Quality-passing successful requests divided by total attempts. Requires at least one request-latency, TTFT, or ITL quality filter.
Success rate	`success_rate`, `request_success_rate`	`avg`, `min`, `max`	Successful completed requests divided by total attempts.
Error rate	`error_rate`, `request_error_rate`	`avg`, `min`, `max`	Failed requests divided by total attempts.
Cancellation rate	`cancellation_rate`, `request_cancellation_rate`	`avg`, `min`, `max`	Cancelled requests divided by total attempts.

For window-level rate and ratio metrics, avg, min, and max currently evaluate the same scalar value for each adaptive window.

Adaptive users is valid only on user_centric phases. It controls target live simulated user timelines while keeping total QPS fixed; changing users changes per-user turn gap and population pressure rather than acting as another spelling of request rate.

1 benchmark:
2   phases:
3     - name: profiling
4       type: user_centric
5       users: 5000
6       rate: 100
7       duration: 8h
8       adaptive_scale:
9         enabled: true
10         control:
11           variable: users
12           min: 500
13           max: 5000
14         assessment_period: 300
15         sustain_duration: 6h
16       sla:
17         time_to_first_token:
18           p95:
19             le: 3000
20         cancellation_rate:
21           avg:
22             le: 0.10

Adaptive scale writes two timing-owned artifacts into the run directory:

adaptive_scale_events.jsonl
adaptive_scale_summary.json

These artifacts use schema version 2 and generic control fields such as control_variable, control_value_before, control_value_after, boundary_value, last_passing_value, and first_failing_value. Every adaptive_window event includes all evaluated SLA values and the binding constraint. Dynamo-style pollers should gate fault injection on explicit events such as sustain_started rather than fixed sleeps.

Use adaptive scale when you want continuous pressure inside one benchmark invocation. Use sweep or adaptive_search when you want offline multi-run exploration across many independent trials.

Sweeps in YAML

Sweeps are the killer feature of YAML configs. The CLI only ever supported list-style flags like --concurrency 8,16,32. YAML lets you sweep any field, combine multiple parameters, or pull from a quasi-random distribution.

Here’s a 3 × 3 = 9-run grid sweep over input length and request rate:

1 schemaVersion: "2.0"
2 
3 sweep:
4   type: grid
5   parameters:
6     datasets.default.prompts.isl: [128, 512, 2048]
7     rate: [10.0, 30.0, 50.0]
8 
9 benchmark:
10   model: meta-llama/Llama-3.1-8B-Instruct
11   endpoint:
12     url: http://localhost:8000/v1/chat/completions
13   dataset:
14     type: synthetic
15     entries: 500
16     prompts: {isl: 512, osl: 128}      # isl is overridden by the sweep
17   phases:
18     - name: profiling
19       type: poisson
20       rate: 20.0                        # overridden by the sweep
21       duration: 120

The parameters: keys are dot-paths into the benchmark: body. For lists, the second segment is the entry’s name:

phases.profiling.rate → the phase named profiling, field rate
datasets.default.prompts.isl → the dataset named default (the singular dataset: shorthand auto-names it default)

The 12 most-swept phase fields also have bare-name sugar: concurrency, prefill_concurrency, rate, requests, duration, sessions, users, smoothness, grace_period, concurrency_ramp, prefill_ramp, rate_ramp. Each expands to phases.profiling.<name> (resolves to the unique non-warmup phase). The two forms are equivalent — see Bare-Name Aliases.

Other sweep modes available in YAML:

zip — pair parameters lockstep instead of cross-product (useful for paired ISL/OSL).
scenarios — hand-curated named workload profiles, each a deep-merge over the base body.
sobol / latin_hypercube — quasi-random space-filling samples.
adaptive_search — Bayesian optimization over multiple objectives.

For a guided picker, see Parameter Sweeps — Choosing a sweep mode.

You can preview what a sweep will run before spending any compute:

$ aiperf config expand sweep.yaml             # lists the variations
$ aiperf config expand sweep.yaml --full      # dumps each variation's full body
> aiperf config expand sweep.yaml --index 2 --full   # inspect one variation

Repeating a benchmark for confidence intervals

Running the same benchmark several times and taking the mean ± confidence interval is a separate envelope-level setting:

1 multi_run:
2   num_runs: 3
3   cooldown_seconds: 15.0
4   confidence_level: 0.95
5   set_consistent_seed: true
6   disable_warmup_after_first: true   # warmup once, reuse the warm cache

multi_run and sweep compose: a 9-variation grid × 3 runs = 27 benchmarks, with confidence intervals computed per variation. See Multi-Run Confidence Reporting for what the report looks like.

CLI helpers for working with configs

Three commands cover the common authoring tasks:

$ # Scaffold from a template (27+ bundled, covering most workloads)
$ aiperf config init --list                          # browse
$ aiperf config init --search sweep                  # search by keyword
$ aiperf config init --template goodput_slo \
>   --model meta-llama/Llama-3.1-70B-Instruct \
>   --url http://localhost:8000/v1/chat/completions \
>   --output benchmark.yaml
$ 
$ # Lint a config without running it
$ aiperf config validate benchmark.yaml
$ 
$ # Preview sweep variations
$ aiperf config expand sweep.yaml --full

validate runs the same load pipeline profile does, so anything wrong shows up here — typos, missing required fields, sweep paths that don’t resolve, env vars that aren’t set.

Mixing YAML with CLI flags

YAML configs and CLI flags are not either/or. Flags overlay whatever’s in the file:

$ aiperf profile --config benchmark.yaml \
>   --concurrency 32 \
>   --artifact-dir ./run-2026-05-09

This loads benchmark.yaml as the base, then overrides the profiling phase’s concurrency with 32 and the artifact directory with the new path. (CLI loadgen flags overlay onto the phase named profiling — they don’t broadcast to every named phase, so multi-phase configs need YAML edits to tweak warmup or other phases.) Useful when most of your config is stable but you want to tweak one knob from a script or CI job.

The precedence order, lowest to highest:

Defaults baked into AIPerf
Values in the YAML file
Explicit CLI flags

Where to go next

Bundled templates — 27+ ready-to-run examples grouped by category (Getting Started, Load Testing, Datasets, Sweep & Multi-Run, Advanced, Multimodal, Specialized Endpoints).
Parameter Sweeps — Choosing a sweep mode — picker for grid vs. zip vs. scenarios vs. Sobol vs. adaptive search.
Parameter Sweeps — deeper dive on sweep mechanics, output structure, and Pareto analysis.
Multi-Run Confidence Reporting — how multi_run propagates through reports.
CLI Options — every CLI flag, in case you want to overlay one onto a YAML config.