Benchmark the Performance of Your NVIDIA RAG Blueprint System#

After you deploy your NVIDIA RAG Blueprint system, benchmark its performance: latency, throughput, and per-stage timing: using the bundled rag-perf CLI.

For accuracy benchmarks (RAGAS-based scoring of answer quality), see scripts/eval/README.md: the runnable evaluate_rag.py CLI. (For the conceptual / notebook overview of RAGAS metrics, see Evaluate Your NVIDIA RAG Blueprint System.) The two tools are complementary: evaluate_rag.py measures how well the system answers; rag-perf measures how fast and at what concurrency.

What `rag-perf` measures#

For each benchmark point, rag-perf runs two passes against the deployed RAG server and folds the results into a unified report:

Profiling pass: direct async httpx requests against the RAG server. Captures server-side per-stage timing that a generic load tool cannot see: time spent in retrieval, reranking, and LLM TTFT, plus citation counts and relevance scores from the streamed response. From this pass rag-perf infers which stage is the bottleneck for the current configuration (retrieval, reranking, or llm).
Load-test pass: aiperf drives concurrent traffic against the same server through the bundled nvidia_rag endpoint plugin. Captures TTFT mean / p50 / p90 / p99, end-to-end latency p90 / p99, output-token throughput, request throughput, and error rate.

The combined output is a single RagMetricsSummary rendered as a Rich terminal table, a Markdown report, and structured JSON / CSV for downstream graphing. Set aiperf.enabled: false in the YAML to skip the load-test pass entirely: useful for fast iteration on retrieval/reranker tuning.

Quickstart#

This section runs a full benchmark in under three minutes against a default deployment using the queries shipped with the tool (scripts/rag-perf/examples/queries.jsonl).

Prerequisites: the RAG server must be running and reachable on the network: for example, after completing the Quickstart: self-hosted Docker. Python ≥ 3.11 on the machine running the benchmark.

# 1. Install rag-perf into its own uv-managed venv (one-time).
uv sync --project scripts/rag-perf

# 2. Edit configs/single_run.yaml to point at your collection (rag.collection_names),
#    then run it.
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

# 3. View the report.
ls rag-perf-results/single_run/run_*/
# report.md  results.csv  results.json  profiling/  aiperf_rag_on/

You should see a Rich-rendered table on stdout with the bottleneck stage, TTFT percentiles, throughput, and error rate. The report.md file contains the same data in Markdown form for sharing or PR attachments.

Tip: copy the preset to your own YAML (for example cp configs/single_run.yaml my_run.yaml) and edit fields there. The CLI takes only --config, so the YAML is the unit of versionable configuration.

Three preset workflows#

rag-perf is a single command: rag-perf -c <config>: and behaviour is fully driven by the YAML you pass it. Three presets cover the common workflows; each section below describes when to use it, what it produces, and how to invoke it.

Preset	When to use	Approximate runtime
`quick_profile.yaml`	Iterating on retrieval / reranker tuning. No load test.	~30 s
`single_run.yaml`	One concurrency level; full report (profiling + load test).	~2 min
`sweep.yaml`	Compare across multiple values of any axis. Make `load.concurrency`, `rag.vdb_top_k`, or `rag.reranker_top_k` a list to sweep that axis; multiple lists give a full Cartesian sweep.	A few minutes per point.

Quick profiling#

Use this when you want server-side stage timing fast: for example to check whether retrieval or reranking is the bottleneck after changing vdb_top_k. No load is generated.

Config: scripts/rag-perf/configs/quick_profile.yaml.

Before you run: edit rag.collection_names in the config to point at a real collection on your deployed ingestor server. The shipped value is ["<collection_name>"], which the run will fail to retrieve from.

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml

Output (under output.dir, default ./rag-perf-results/quick_profile/):

run_<ts>/
├── profile_report.md
├── profile_results.json
└── profiling/
    └── profiler_records.jsonl

The profile_* filename prefix flags this as a profile-only run (no aiperf data). To convert any other config to profile-only, set aiperf.enabled: false in the YAML.

Single-point run#

Use this when you want a single benchmark point with both passes: for example a regression check at a known-good concurrency before launching a larger sweep.

Config: scripts/rag-perf/configs/single_run.yaml.

Before you run: edit rag.collection_names in the config to point at a real collection on your deployed ingestor server.

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

Output:

run_<ts>/
├── report.md
├── results.csv
├── results.json
├── profiling/
│   └── profiler_records.jsonl
└── aiperf_rag_on/
    ├── profile_export_aiperf.csv
    ├── profile_export_aiperf.json
    └── profile_export.jsonl

Concurrency sweep#

Use this to compare TTFT, latency, and throughput across multiple concurrency levels. The config’s load.concurrency is a list (for example [1, 4, 8, 16, 32]); each value is a benchmark point. Edit the list in the YAML to add or remove levels.

Config: scripts/rag-perf/configs/sweep.yaml.

Before you run: edit rag.collection_names in the config to point at a real collection on your deployed ingestor server.

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml

Output is nested: each grid point gets its own subdirectory, with aggregate report files at the run root:

run_<ts>/
├── report.md
├── results.csv             # one row per point
├── results.json
└── iter_1/
    ├── CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:.../
    │   ├── profiling/
    │   └── aiperf_rag_on/
    ├── CR:4_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:.../
    └── ...

When load.iterations > 1, the entire grid is repeated and each repetition writes to its own iter_<i>/ subdirectory so variance can be measured across runs.

To run a full Cartesian sweep across concurrency × vdb_top_k × reranker_top_k, change any of those fields from a scalar to a list (for example rag.vdb_top_k: [20, 100]). For overnight runs, set load.sleep_between_points_s: 60 so the server has time to drain in-flight requests between points (this matches the blueprint pipeline’s default).

What you will see on stdout#

Every invocation prints, in order:

The startup banner plus a one-line summary of target / collection / top-k / input source / concurrency / aiperf state.
The fully resolved configuration as YAML, dumped verbatim so the run is reproducible from the terminal output alone (every field that drove the run, including the synthetic block and any defaults that were filled in).
For each grid point, a per-iteration log line plus the aiperf shell command in copy-pastable form ($ python -m aiperf profile -m … --endpoint-type nvidia_rag …) before the subprocess fires.
After each grid point completes, a rich per-point summary table with the full stage breakdown (retrieval / reranking / LLM TTFT) with percent-of-TTFT bars, citation count and relevance score, the inferred bottleneck, plus the load-test block (TTFT / E2E / throughput / error rate).
After all points finish, a side-by-side comparison table auto-labelled by whichever axis varied (concurrency / vdb_top_k / reranker_top_k / iter), and a one-liner identifying the optimal-throughput point.

The same data is also persisted under output.dir/run_<ts>/: see Output layout: so terminal scrollback is not load-bearing.

Query inputs#

rag-perf needs a stream of queries to drive at the RAG server. The input block in the YAML chooses where they come from. Set exactly one of input.file or input.synthetic: they are mutually exclusive and validation fails if both are present. When neither is set, synthetic is auto-filled with defaults so a bare config still validates.

File-based queries#

Set input.file to a path. The format is auto-detected from the extension:

.jsonl: one JSON object per line. Each object must have a query key. Any field also defined under rag.* or generation.* becomes a per-query override (so a single file can mix multiple collections, top-K values, max-token caps, and so on):
```
{"query": "What was NVIDIA revenue?", "collection_names": ["finance"]}
{"query": "Summarize the 10-K risks.", "vdb_top_k": 50}
{"query": "Show me chart-heavy pages.", "max_tokens": 1024}
```
.csv: must have a query column. Other columns whose names match rag.* or generation.* field names become per-query overrides; CSV cell values are JSON-parsed when possible (so a cell like ["finance"] is interpreted as a list, not a string).

If the requested number of requests exceeds the file’s row count, rag-perf re-uses queries according to input.sampling:

random (default): random with replacement.
sequential: cycle through in order.
shuffle-once: shuffle once, then cycle.

input.seed (default 42) makes sampling reproducible across runs.

A small example file ships at scripts/rag-perf/examples/queries.jsonl; the three preset configs all point at it by default.

Synthetic queries (LLM-generated)#

When input.synthetic is set, rag-perf calls an OpenAI-compatible chat-completions endpoint before the benchmark to generate synthetic.num_queries queries, writes them to synthetic.jsonl_output_path, and then runs the benchmark from that JSONL: so the run is reproducible even though the queries were generated.

Two modes:

random: the LLM generates queries from scratch using the prompt templates in synthetic.prompts_file (or the bundled defaults at scripts/rag-perf/prompts/default_prompts.yaml if unset). Useful when you do not have a query corpus yet but want plausible questions to drive load against your collection.
dataset_based: the LLM is seeded with reference questions from a JSON dataset and asked to produce variations. Set either synthetic.dataset_file (explicit path) or synthetic.dataset_name (auto-lookup under ./datasets/<name>/train.json and so on). Validation fails if neither is set in dataset_based mode.

Key knobs:

Field	Purpose
`synthetic.num_queries`	How many distinct queries to generate. The query list is cycled if `total_requests` exceeds it.
`synthetic.min_query_tokens`	Approximate minimum token count per generated query. Combined with `generation.min_tokens == generation.max_tokens` and `generation.ignore_eos: true`, this lets you pin exact input/output token lengths for like-for-like comparisons.
`synthetic.generation_concurrency`	Parallel LLM calls during generation (default `8`). Each completed query is streamed to disk under a write lock: a mid-generation failure preserves everything finished so far. Raise for fast endpoints, lower for rate-limited ones.
`synthetic.temperature`	Sampling temperature for the generator LLM (default `0.9`).
`synthetic.disable_thinking`	Default `true`. Injects `chat_template_kwargs: {enable_thinking: false}` into the request so reasoning models (Nemotron Omni, Qwen-Reasoning, …) skip chain-of-thought and return the question in `content`; otherwise CoT can exhaust the token budget and leave `content` empty. Set `false` only for non-reasoning endpoints.
`synthetic.extra_body`	Escape hatch: arbitrary keys merged into the LLM request body, for example `{top_p: 0.95, presence_penalty: 0.5, response_format: {type: json_object}}`. Merged after `disable_thinking`, so explicit keys win.
`synthetic.llm_url`	OpenAI-compatible endpoint used for generation. Typically the same NIM the RAG server proxies, but it can be any chat-completions endpoint.
`synthetic.llm_model`	Model name passed to the endpoint. Empty string → auto-discover from `/v1/models`.
`synthetic.prompts_file`	Custom YAML of prompt templates. `null` falls back to the bundled `prompts/default_prompts.yaml`.
`synthetic.jsonl_output_path`	Where the generated queries land. Re-running with the same path overwrites it.

Because generated queries are persisted to disk, you can flip a synthetic-driven config to a file-driven one for subsequent runs by replacing the synthetic block with file: <jsonl_output_path>: useful for keeping the load identical while iterating on retrieval or reranker tuning.

Configuration reference#

Configuration is a YAML file validated by rag_perf.config.RunConfig (Pydantic v2). Top-level sections:

Top-level#

Field	Type	Default	Purpose
`target`	object	:	Where the RAG server lives.
`aiperf`	object	:	Whether to run the aiperf load-test phase.
`load`	object	:	Load-generation parameters.
`rag`	object	:	RAG-specific request body fields forwarded to `/v1/generate`.
`generation`	object	:	LLM generation parameters (max_tokens, temperature, …).
`input`	object	:	Where queries come from.
`output`	object	:	Output directory and formats.
`model_name`	string	`nvidia/nemotron-3-super-120b-a12b`	Model name passed to aiperf using `-m`. Should match `APP_LLM_MODELNAME`.
`tokenizer`	string	`""`	Optional HuggingFace tokenizer ID for token counting; empty = use server-reported counts.

`target`#

Field	Type	Default	Purpose
`url`	string	`http://localhost:8081`	Base URL of the RAG server.
`timeout_s`	int	`300`	Per-request wall-clock timeout in seconds.

`aiperf`#

Field	Type	Default	Purpose
`enabled`	bool	`true`	When `false`, skip the aiperf load test (profile-only mode).

`load`#

Field	Type	Default	Purpose
`mode`	`concurrency` \| `request_rate`	`concurrency`	Load-generation strategy.
`concurrency`	int \| list[int]	`8`	Scalar = single point. List = sweep across that axis.
`request_rate`	float \| null	`null`	Target req/s when `mode=request_rate`.
`warmup_requests`	int	`10`	Requests sent (and discarded) before measurement.
`total_requests`	int	`200`	Total measured requests per point.
`duration_s`	float \| null	`null`	If set, run by wall-clock duration instead of request count.
`profile_requests`	int	`20`	Number of requests in the server-side profiling pass that runs before aiperf.
`iterations`	int	`1`	Repeat the full grid this many times. Each repetition writes to its own `iter_<i>/` subdirectory.
`sleep_between_points_s`	int	`0`	Seconds to sleep between grid points so the server can drain. `60` matches the blueprint pipeline’s default.

`rag`#

Forwarded verbatim to POST /v1/generate. Any field can be overridden per-query (see Query inputs).

Field	Type	Default	Purpose
`collection_names`	list[string]	`["default"]`	Vector-DB collection(s) to search.
`vdb_top_k`	int \| list[int] (each 1–400)	`100`	Chunks retrieved from the vector DB before reranking. Scalar = single value, list = sweep axis.
`reranker_top_k`	int \| list[int] (each 1–25)	`10`	Chunks passed to the LLM after reranking. Scalar = single value, list = sweep axis.
`enable_reranker`	bool	`true`	Whether to run the reranker stage.
`enable_citations`	bool	`true`	Whether the server returns citation chunks.
`use_knowledge_base`	bool	`true`	False = skip retrieval, send query bare to the LLM.
`confidence_threshold`	float (0–1)	`0.0`	Minimum relevance score for retrieved chunks.

`generation`#

Field	Type	Default	Purpose
`max_tokens`	int	`512`	Maximum output tokens.
`min_tokens`	int \| null	`null`	Minimum output tokens. Set equal to `max_tokens` to pin output length.
`ignore_eos`	bool	`false`	Pass `ignore_eos:true` to the inference backend; combine with `min_tokens` for fixed output length.
`temperature`	float (0–2)	`0.0`	Sampling temperature.

`input`#

Set exactly one of file or synthetic (mutually exclusive). When both are unset, synthetic is auto-filled with defaults so a bare config still validates.

Field	Type	Default	Purpose
`file`	string \| null	`null`	Path to a query file. Format auto-detected from extension: `.jsonl` (one JSON object per line with a `query` key) or `.csv` (must have a `query` column). Mutually exclusive with `synthetic`.
`synthetic`	object \| null	`null` (auto-filled when `file` is also null)	LLM-generated synthetic queries. Fields below. Mutually exclusive with `file`.
`sampling`	string	`random`	Sampling strategy for the query list (`random`, `sequential`, `shuffle-once`).
`seed`	int	`42`	RNG seed for reproducible sampling.
`synthetic.mode`	`random` \| `dataset_based`	`random`	LLM-based generation strategy.
`synthetic.num_queries`	int	`50`	Distinct synthetic queries to generate.
`synthetic.min_query_tokens`	int	`50`	Approximate min token count per generated query.
`synthetic.llm_url`	string	`http://localhost:8999/v1/chat/completions`	OpenAI-compatible endpoint for generation.
`synthetic.llm_model`	string	`""`	Model name; empty = auto-discover from `/v1/models`.
`synthetic.prompts_file`	string \| null	`null`	Custom prompt templates YAML; null = use bundled defaults.
`synthetic.jsonl_output_path`	string	`./rag-perf-synthetic-queries.jsonl`	Where generated queries are written.
`synthetic.dataset_file`	string \| null	`null`	Explicit dataset file (required for `dataset_based`).
`synthetic.dataset_name`	string \| null	`null`	Dataset name for auto-lookup under `./datasets/`.

`output`#

Field	Type	Default	Purpose
`dir`	string	`./rag-perf-results`	Root output directory; a timestamped subdir is created per run.
`formats`	list[string]	`[json, csv]`	Export formats: `json`, `csv`, `jsonl_raw`.
`markdown_report`	bool	`true`	Write a Markdown summary (`report.md`).
`save_responses`	bool	`false`	Persist full generated text per request (large).
`cluster`	string	`""`	Cluster identifier stamped into artifact dir names.
`gpu`	string	`""`	GPU type stamped into artifact dir names.
`experiment_name`	string	`""`	Experiment label stamped into artifact dir names.

CLI surface#

The CLI is intentionally minimal. The YAML is the single source of truth for behaviour; to vary a parameter, edit the file or copy it to a new one. This keeps every run reproducible from a single artefact you can commit to version control.

Flag	Purpose
`-c / --config FILE`	Required. Path to the YAML config.
`--help`	Show usage and exit.
`--version`	Print the rag-perf version and exit.

Output layout#

Every invocation creates a fresh timestamped directory output.dir/run_<ts>/. The contents depend on the run shape:

Single point + aiperf.enabled=true: flat layout:

run_<ts>/{report.md, results.csv, results.json, profiling/, aiperf_rag_on/}

Single point + aiperf.enabled=false: flat, profile-only layout:

run_<ts>/{profile_report.md, profile_results.json, profiling/}

Multiple points or load.iterations > 1: nested layout:

run_<ts>/
├── report.md, results.csv, results.json     # aggregate, one row per point
└── iter_<i>/
    └── CR:<c>_ISL:<i>_OSL:<o>_VDB-K:<v>_RERANKER-K:<r>_Model:<m>[_Cluster:<c>][_GPU:<g>][_Experiment:<e>]/
        ├── profiling/
        └── aiperf_rag_on/

Source layout#

scripts/rag-perf/
├── pyproject.toml
├── configs/
│   ├── quick_profile.yaml      # profile-only preset
│   ├── single_run.yaml         # one concurrency, full report
│   └── sweep.yaml              # multi-axis sweep (concurrency / vdb_top_k / reranker_top_k as scalar or list)
├── examples/queries.jsonl       # sample query JSONL
├── prompts/default_prompts.yaml # synthetic-query prompt templates
└── rag_perf/
    ├── __init__.py             # public API re-exports
    ├── __main__.py             # python -m rag_perf entry point
    ├── config.py               # RunConfig + sub-models + the three enums
    ├── query.py                # QueryLoader + SyntheticQueryGenerator
    ├── runner.py               # RagProfiler + AiperfRunner + BenchmarkRunner.run
    ├── reporting.py            # MetricsAggregator + Reporter + result dataclasses
    ├── cli.py                  # Single Click command + startup banner
    └── plugin/                 # aiperf endpoint plugin (nvidia_rag)

Unit tests live under tests/unit/test_rag_perf/ (run with uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/).