Comprehensive LLM Benchmarking | NVIDIA AIPerf Documentation

Presentation Date: November 13, 2025
Updated: February 2, 2026
Tool: AIPerf v0.5.0 | Architecture Overview | Full Documentation

Setup: Installing AIPerf 0.5.0

$ pip install aiperf

Key Features in 0.5.0:

✅ Server-side metrics collection via Prometheus
✅ Automatic plot generation (aiperf plot command)
✅ KV cache efficiency testing with trace synthesis
✅ User-centric timing mode for multi-turn KV cache TTL testing
✅ Goodput analysis for SLA compliance measurement
✅ Time-sliced analysis for performance trends over time

📚 Documentation: See the full CLI Options Reference for all available parameters.

Test Endpoint Details

Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.

Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)
Inference Engine: vLLM v0.11.0
Architecture: 8-way data parallelism (8 independent vLLM replicas)
Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)
Deployment: Kubernetes on Nebius Cloud

Why this endpoint was chosen for the demo:

Small model (~600M parameters) = high throughput for benchmarking
8 replicas = demonstrated horizontal scaling
Public access = allowed live demonstration

Follow along locally: You can run a single vLLM replica to try the commands in this guide (results will differ from the multi-replica setup above):

$ docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
>   --model Qwen/Qwen3-0.6B \
>   --host 0.0.0.0 --port 8000
$ 
$ # Wait for the server to be ready
$ timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
$ 
$ export ENDPOINT_URL=localhost:8000

Use Case 1: Simple Profiling with Static ISL/OSL

Goal: Measure baseline performance under controlled load

Command

$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --concurrency 100 \
>   --request-count 1000 \
>   --isl 1000 \
>   --osl 500 \
>   --tokenizer Qwen/Qwen3-0.6B

Parameters Explained

Arg	Value	Purpose
`--model`	`qwen3-0.6b`	Model identifier (matches endpoint)
`--url`	`$ENDPOINT_URL`	Target inference server
`--endpoint-type`	`chat`	OpenAI chat completions API
`--streaming`	(flag)	Enable token streaming
`--concurrency`	`100`	Simultaneous connections
`--request-count`	`1000`	Total requests to send
`--isl`	`1000`	Input tokens per request
`--osl`	`500`	Output tokens per response
`--tokenizer`	`Qwen/Qwen3-0.6B`	HuggingFace tokenizer for accuracy

Key Insight: This creates 100 “virtual users” sending 1,000 requests total with large payloads (1000→500 tokens).

Results

                          NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ Metric                  ┃      avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ Time to First Token (ms)│   347.15 │ 204.55 │1,052.66│  815.02│  577.05│  289.49│ 143.57│
│ Request Latency (ms)    │ 2,101.75 │ 693.08 │4,770.98│3,613.75│2,319.79│2,057.50│ 303.17│
│ Inter Token Latency (ms)│     3.57 │   1.99 │   8.55 │   5.78 │   3.93 │   3.49 │   0.54│
│ Output Token Throughput │22,521.42 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│           (tokens/sec)  │          │        │        │        │        │        │       │
│ Request Throughput      │    45.70 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│      (requests/sec)     │          │        │        │        │        │        │       │
│ Request Count           │ 1,000.00 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
└─────────────────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴───────┘
Benchmark Duration: 21.88 sec
Success Rate: 100% (0 errors)

Key Takeaways

✅ TTFT = 347ms: Fast first token delivery - users see responses quickly
✅ Request Latency = 2.1s: Total time to generate 500 tokens per request
✅ System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users
✅ ITL = 3.57ms: Smooth, consistent token streaming
✅ P99 Latency = 3.6s: Even worst-case requests complete reasonably fast

What we learned:

With 100 concurrent users and large payloads (1000→500 tokens), the system maintained stable performance
P99 latency (3.6s) vs avg (2.1s) shows good consistency - only ~70% variance at tail
Zero errors = reliable service under load
22.5K tokens/sec sustained throughput demonstrates 8-replica scaling effectiveness

Evolution: Pareto Curve Analysis - Resource Efficiency vs. User Experience

Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.

The Experiment

We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:

$ # Run the same benchmark at multiple concurrency levels
$ for c in 10 50 100 200 500; do
$   aiperf profile --model qwen3-0.6b --url "$ENDPOINT_URL" \
>     --endpoint-type chat --streaming --concurrency $c \
>     --request-count 1000 --isl 1000 --osl 500 \
>     --tokenizer Qwen/Qwen3-0.6B --artifact-dir "artifacts/pareto-c$c"
$ done

Results: The Pareto Curve

Concurrency	Total TPS	TPS/GPU	TPS/User	TTFT (avg)
10	3,045	1,522	364.69	~250 ms
50	12,890	6,445	326.10	~270 ms
100	22,521	11,261	285.03	~347 ms
200	35,999	18,000 ⭐	238.67	~420 ms
500	29,836	14,918	128.85	~1,129 ms

Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)

Visualizing the Trade-off

The Pareto frontier shows the inverse relationship between resource efficiency and user experience:

Point	Concurrency	TPS/GPU	TPS/User	Interpretation
Far Right	c=10	1,522	365	Best user experience, poor GPU utilization
Moving Up-Left	c=50	6,445	326	Trading UX for efficiency
	c=100	11,261	285	Balanced middle ground
Peak ⭐	c=200	18,000	239	Maximum GPU efficiency
Collapse	c=500	14,918	129	Over-saturation degrades both

Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).

Key Insights from the Pareto Curve

✅ Low Concurrency (10-50):

Poor resource utilization: Only 1,500-6,500 TPS/GPU = GPUs are underutilized
Best user experience: 365 tokens/sec per user = very responsive
Use case: Premium tier, low-latency applications

✅ Medium Concurrency (100-200):

Balanced performance: ~11,000-18,000 TPS/GPU
Good user experience: ~240-285 tokens/sec per user
Sweet spot at c=200: Peak resource utilization (18K TPS/GPU) with acceptable user experience
Use case: General production workloads

❌ High Concurrency (500+):

Degraded resource utilization: TPS/GPU drops from 18K → 15K
Poor user experience: 129 tokens/sec per user, TTFT = 1.1 seconds
Queuing dominates: Request backlog causes both metrics to degrade
Use case: Avoid this region unless cost is the only priority

The Business Trade-off

Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?

Priority	Optimal Concurrency	Justification
User Experience	10-50	Sub-300ms TTFT, 325+ tokens/sec/user
Balanced	100-200 ⭐	18K TPS/GPU, 240+ tokens/sec/user
Cost Efficiency	200	Peak TPS/GPU before degradation

The c=200 “sweet spot”:

12x better resource utilization vs. c=10 (18K vs. 1.5K TPS/GPU)
Only 35% reduction in per-user throughput (239 vs. 365 tokens/sec/user)
TTFT still under 500ms for most requests

What We Learned

🔍 Performance is non-linear: Doubling concurrency doesn’t double throughput
📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead
⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience
🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities

Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!

Use Case 2: Auditing Raw Results - Custom Percentile Analysis

Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.

Goal: Calculate P75 TTFT from raw benchmark data.

Understanding the Raw Data: profile_export.jsonl

AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record. See the Working with Profile Exports tutorial for more analysis techniques.

1 {
2   "metadata": {
3     "session_num": 87,
4     "x_request_id": "abd8df1a-7904-4aa0-8107-0d74ba0ac0d7",
5     "turn_index": 0,
6     "request_start_ns": 1763066701865462000,
7     "request_end_ns": 1763066703082535666,
8     "worker_id": "worker_b431129c",
9     "record_processor_id": "record_processor_a1b2c3d4",
10     "benchmark_phase": "profiling"
11   },
12   "metrics": {
13     "time_to_first_token": {
14       "value": 582.66,
15       "unit": "ms"
16     },
17     "output_token_count": {
18       "value": 194,
19       "unit": "tokens"
20     },
21     "request_latency": {
22       "value": 1210.008,
23       "unit": "ms"
24     },
25     "input_sequence_length": {
26       "value": 1000,
27       "unit": "tokens"
28     },
29     "output_sequence_length": {
30       "value": 194,
31       "unit": "tokens"
32     },
33     "inter_token_latency": {
34       "value": 3.25,
35       "unit": "ms"
36     }
37   }
38 }

Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.

Note: The metadata section may contain additional optional fields including was_cancelled, cancellation_time_ns, conversation_id, x_correlation_id, and timing fields like credit_issued_ns and request_ack_ns. The benchmark_phase field is either "warmup" or "profiling".

Calculating P75 TTFT

1 import json
2 import numpy as np
3 from pathlib import Path
4 
5 # Read all TTFT values
6 ttft_values = []
7 with open("./artifacts/profile_export.jsonl", 'r') as f:
8     for line in f:
9         record = json.loads(line)
10         ttft = record['metrics']['time_to_first_token']['value']
11         ttft_values.append(ttft)
12 
13 # Calculate P75
14 p75_ttft = np.percentile(ttft_values, 75)
15 print(f"P75 TTFT: {p75_ttft:.2f} ms")

Results from Our Benchmark

============================================================
TTFT Percentile Analysis
============================================================
Total requests analyzed: 1000
Percentiles (ms):
  P25 (25th percentile): 242.45 ms
  P50 (50th percentile): 289.49 ms
  P75 (75th percentile): 422.87 ms  ⭐ YOUR SLA METRIC
  P90 (90th percentile): 577.05 ms
  P99 (99th percentile): 815.02 ms
============================================================

Key Takeaways

✅ P75 = 422.87ms: 75% of requests get first token within this time
✅ Raw data access: Calculate ANY custom metric your org needs
✅ Full transparency: Every request is logged with complete metrics
✅ Easy parsing: Standard JSON format, one record per line

Why this matters:

Different orgs have different SLA definitions
P75 is a common SLA target (balance between typical and worst-case)
AIPerf’s raw exports let you calculate ANY percentile or custom metric
No need to re-run benchmarks for different analysis

Use Case 3: Trace-Based Benchmarking with Mooncake

Goal: Test your system under realistic production workload patterns using privacy-preserving traces.

📚 Documentation: See Benchmark Datasets for supported dataset formats and Trace Replay Mode for detailed configuration.

What is Mooncake Trace Data?

Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:

Request arrival times
Input/output token lengths
Block hash IDs: Privacy-preserving identifiers for KV cache reuse patterns

Understanding Block Hashing

The Problem: Sharing production traces risks leaking sensitive user data.

Mooncake’s Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.

Example: Multi-turn conversation

Turn 1: User uploads paper (7,500 tokens) + question (500 tokens)
├─ Total: 8,000 tokens = 16 blocks
└─ Hash IDs: [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
Turn 2: Same paper + different question (8,500 tokens)
├─ Total: 8,500 tokens = 17 blocks
├─ Hash IDs: [46-61] (reused!) + [62] (new)
└─ ✅ Cache hit rate: 94% (16/17 blocks reused)
Turn 3: Same paper + another question (9,000 tokens)
├─ Total: 9,000 tokens = 18 blocks
├─ Hash IDs: [46-61] (reused!) + [62, 63] (new)
└─ ✅ Cache hit rate: 89% (16/18 blocks reused)

Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.

The Mooncake arXiv Trace Dataset

======================================================================
MOONCAKE ARXIV TRACE - DATASET CHARACTERISTICS
======================================================================
📊 OVERALL STATISTICS
  Total Requests: 23,608
  Duration: 60.0 minutes (3,600 seconds)
  Avg Request Rate: 393.5 requests/minute
📏 TOKEN DISTRIBUTION (Input + Output)
  Mean: 8,772 tokens
  Median: 6,402 tokens
  P25: 3,331 tokens  |  P75: 7,562 tokens
  P90: 17,140 tokens |  P99: 61,961 tokens
  Max: 125,878 tokens
📊 REQUEST SIZE DISTRIBUTION
  Token Range          | Count  | Percentage | Visualization
  ──────────────────────────────────────────────────────────
       0 -  5,000   | 7,632 |  32.3%    | ████████████████
   5,000 - 10,000   | 11,626|  49.2%    | ████████████████████████
  10,000 - 20,000   | 2,499 |  10.6%    | █████
  20,000 - 40,000   | 1,325 |   5.6%    | ██
  40,000 - 60,000   |   272 |   1.2%    |
  60,000 - 80,000   |   135 |   0.6%    |
  80,000 - 100,000  |    65 |   0.3%    |
  100,000+          |    54 |   0.2%    |
⏱️  REQUEST ARRIVAL PATTERN (5-minute windows)
  Time Window         | Requests | Load Pattern
  ───────────────────────────────────────────────────────
    0 -   4 min     | 1,765    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    5 -   9 min     | 1,657    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   10 -  14 min     | 1,875    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   15 -  19 min     | 1,860    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   20 -  24 min     | 1,992    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   25 -  29 min     | 2,010    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   30 -  34 min     | 2,012    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   35 -  39 min     | 2,063    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   40 -  44 min     | 2,133    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   45 -  49 min     | 2,026    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   50 -  54 min     | 2,125    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   55 -  59 min     | 1,680    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
======================================================================

Key characteristics of real production traffic:

✅ Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K
✅ Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K
✅ Consistent Load: ~393 requests/minute with relatively steady arrival rate
✅ Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)

This represents real-world patterns you won’t get from synthetic benchmarks:

Multi-turn conversations (shared hash IDs across requests)
Variable request sizes (not uniform 1K/500 like Use Case 1)
Realistic timing (actual production arrival patterns)
Long-context queries that stress-test model limits

Running a Trace-Based Benchmark

$ # Download the Mooncake trace
$ curl -o mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl
$ 
$ # Option 1: Replay with original timing (for end-to-end system testing)
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --input-file mooncake_trace.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --tokenizer Qwen/Qwen3-0.6B
$ 
$ # Option 2: Replay as fast as possible (for capacity testing)
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --input-file mooncake_trace.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --tokenizer Qwen/Qwen3-0.6B

Key Differences from Synthetic Benchmarks

Aspect	Use Case 1 (Synthetic)	Use Case 3 (Trace-Based)
Request Pattern	Uniform (all 1000→500)	Variable (100→2,000K tokens)
Arrival Timing	Constant concurrency	Bursty, realistic timing
KV Cache	No reuse patterns	Real cache-sharing patterns
Use Case	Steady-state capacity	Production validation

📚 Documentation: See the Timing Modes Reference for all supported timing modes.

Why Trace-Based Benchmarking Matters

✅ Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load
✅ KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates
✅ Capacity Planning: See performance under bursty traffic with variable request sizes
✅ Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data

Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity. See the Fixed Schedule Tutorial for more details.

Real Benchmark Results: 5-Minute Mooncake Trace (5x Speed)

We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:

$ # Create the subset (first 5 minutes, sped up 5x)
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --input-file mooncake_trace_5min_5x.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --tokenizer Qwen/Qwen3-0.6B

Results:

                          NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  407.42 │ 212.68 │ 1,519.5 │  951.16 │  586.01 │ 370.20 │  150.12 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,171.0 │ 243.14 │ 6,665.7 │ 4,184.4 │ 2,615.9 │ 648.33 │  978.09 │
│ Inter Token Latency   │    5.97 │   0.00 │   88.31 │   17.88 │   10.72 │   4.54 │    5.46 │
│              (ms)     │         │        │         │         │         │        │         │
│ Output Sequence Length│  175.27 │   1.00 │ 1,165.0 │  761.65 │  510.00 │  28.00 │  220.30 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Input Sequence Length │ 7,243.0 │ 890.00 │32,236.0 │27,260.0 │15,157.0 │6,344.0 │ 5,536.0 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Output Token          │ 4,675.0 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│ Throughput (tok/sec)  │         │        │         │         │         │        │         │
│ Request Throughput    │   26.68 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Request Count         │ 1,690   │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (successful)     │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘
Benchmark Duration: 63.35 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Observations from Trace-Based Testing

✅ Highly Variable Request Sizes:

Input: 890→32,236 tokens (36x range!)
Output: 1→1,165 tokens
Median input: 6,344 tokens (much larger than our synthetic 1K)

✅ Performance Under Real Load:

TTFT = 407ms average despite 7K+ token median inputs
System handled 4,675 tokens/sec with bursty, variable traffic
P99 TTFT = 951ms (some large requests took longer, as expected)

✅ Realistic Failures:

75 requests (4%) exceeded Qwen3-0.6B’s 32K context limit
This reveals a real operational constraint you’d miss with synthetic tests
Production insight: Need longer-context model or request filtering

✅ Production Timing Patterns:

Trace shows realistic request bursts and lulls
Not constant load like --concurrency 100
More representative of actual user traffic patterns

What we learned from trace-based vs. synthetic testing:

Use Case 1 (synthetic): 100% success, uniform 1K→500 tokens, 22.5K TPS
Use Case 3 (trace): 96% success, variable 890→32K input tokens, 4.7K TPS, revealed context window issues

Trace-based testing exposes real-world challenges that synthetic benchmarks hide!

Use Case 4: Goodput Analysis - Measuring SLA Compliance

Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.

📚 Documentation: See the Goodput Tutorial for additional examples and SLO configuration options.

What is Goodput?

Goodput = The fraction of requests that meet ALL specified SLA thresholds.

Why it matters:

Throughput tells you how many requests/sec your system handles
Goodput tells you how many requests/sec deliver acceptable user experience
A system can have high throughput but low goodput if most requests miss SLAs!

Definition (from DistServe paper):

“Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service.”

Real-World Example: Why Goodput > Throughput

Imagine two systems serving 1000 requests/min:

System A: 950 requests under SLA, 50 requests timeout → 95% goodput
System B: 500 requests under SLA, 500 requests slow → 50% goodput

Both have the same throughput, but System A delivers 2x better user experience!

Running Goodput Analysis

We’ll use the same Mooncake trace, but add SLO thresholds:

$ # Define SLA thresholds based on your business requirements
$ # Example: TTFT ≤ 370ms, Request Latency ≤ 648ms
$ 
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --input-file mooncake_trace_5min_5x.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --tokenizer Qwen/Qwen3-0.6B \
>   --goodput "time_to_first_token:370 request_latency:648"

Goodput Results

                          NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  428.86 │ 209.96 │ 1,651.8 │ 1,109.7 │  649.21 │ 385.29 │  176.32 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,208.9 │ 229.80 │ 6,280.6 │ 4,350.7 │ 2,726.4 │ 691.07 │ 1,005.5 │
│ Request Throughput    │   26.67 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Goodput               │    7.43 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │  ⭐
│ (requests/sec)        │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘
Benchmark Duration: 63.37 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Insights from Goodput Analysis

Goodput vs. Throughput:

Total Throughput: 26.67 requests/sec (100%)
Goodput:           7.43 requests/sec (28%)  ⚠️
────────────────────────────────────────────
Only 28% of requests met BOTH SLO requirements!

Understanding the results:

SLO Thresholds: TTFT ≤ 370ms AND Request Latency ≤ 648ms
Average TTFT: 428ms (above threshold)
Median Latency: 691ms (above threshold)
72% of requests failed to meet at least one SLO
Raw throughput doesn’t reveal this user experience gap!

How SLO compliance works:

Requests must meet ALL SLO criteria to count toward goodput
A request with TTFT=350ms but latency=700ms fails (missed latency SLO)
A request with TTFT=400ms but latency=600ms fails (missed TTFT SLO)
Only requests with TTFT≤370ms AND latency≤648ms count as goodput

What Goodput Tells You That Metrics Don’t

Metric	What It Measures	What It Misses
Average TTFT	Typical first token delay	Tail latency, SLA violations
P99 Latency	Worst-case performance	Overall SLA compliance rate
Throughput	System capacity	User experience quality
Goodput ⭐	% requests meeting SLAs	Nothing - it’s the complete picture!

Using Goodput for Capacity Planning

Question: How many servers do I need to handle 1000 req/sec with 95% goodput?

Without goodput analysis:

Measure throughput: 26.67 req/sec per server
Calculate: 1000 / 26.67 = 38 servers
Problem: This assumes all requests meet SLAs! ❌

With goodput analysis:

Measure goodput: 7.43 req/sec per server (28% of throughput)
Calculate: 1000 / 7.43 = 135 servers
Reality: Need 3.5x more capacity to meet SLAs ✅

The cost of ignoring goodput: Underprovisioning by 250%!

Adjusting SLOs for Your Business

Different use cases need different SLOs:

$ # Strict SLOs (premium tier)
$ --goodput "time_to_first_token:250 request_latency:500"
$ 
$ # Balanced SLOs (standard tier)
$ --goodput "time_to_first_token:370 request_latency:648"
$ 
$ # Relaxed SLOs (batch processing)
$ --goodput "time_to_first_token:600 request_latency:2500"

Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.

Use Case 5: Time-Sliced Analysis - Performance Over Time

Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.

📚 Documentation: See the Timeslices Tutorial for configuration options and the Warmup Tutorial for managing cold-start effects.

What is Time-Slicing?

Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.

Why it matters:

Detect warm-up effects: Identify cold-start latency vs. steady-state performance
Spot degradation: Find memory leaks or resource exhaustion over time
Understand load patterns: See how performance changes as traffic evolves
Validate SLAs over time: Ensure consistent performance, not just averages

Running Time-Sliced Analysis

We’ll use the same Mooncake trace with 10-second time slices:

$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --input-file mooncake_trace_5min_5x.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --tokenizer Qwen/Qwen3-0.6B \
>   --slice-duration 10

Output: AIPerf generates additional files:

profile_export_aiperf_timeslices.csv - Time-series data in tidy format
profile_export_aiperf_timeslices.json - Hierarchical time-series data

Time-Sliced Results

==========================================================================================
TIME-SLICED PERFORMANCE ANALYSIS (10-second slices)
==========================================================================================
Slice |  Time   | Requests | TTFT (ms) | Latency (ms) | Throughput
  #   | Window  |  Count   |  avg (p90)|  avg  (p90)  | (tokens/s)
------------------------------------------------------------------------------------------
  0   |  0-10s  |      111 |   545 (  900) |  1516 ( 3217) |       3203
  1   | 10-20s  |      223 |   381 (  560) |  1050 ( 2300) |       3027
  2   | 20-30s  |      279 |   376 (  502) |  1266 ( 3008) |       4014
  3   | 30-40s  |      293 |   388 (  655) |  1272 ( 2942) |       3648
  4   | 40-50s  |      302 |   387 (  500) |   976 ( 2173) |       3554
  5   | 50-60s  |      303 |   344 (  444) |   999 ( 2313) |       3470
  6   | 60-70s  |      179 |   374 (  517) |  1427 ( 2803) |       4258
==========================================================================================
TREND ANALYSIS:
  TTFT Range: 344ms - 545ms (variation: 58.6%)
  Throughput Range: 3027 - 4258 tokens/s
  First slice TTFT: 545ms vs. Last slice: 374ms
✅ Warm-up detected: TTFT improved after first slice (cold start effect)
==========================================================================================

Key Insights from Time-Sliced Analysis

1. Warm-Up Effect Detected:

Slice 0 (0-10s):   TTFT = 545ms  ⚠️  Cold start
Slice 1 (10-20s):  TTFT = 381ms  ✅  30% improvement after warm-up
Slices 2-6:        TTFT = 344-388ms  ✅  Stable steady-state

Why this matters:

First 10 seconds show 545ms TTFT (above target)
Performance improves 30% after warm-up
Steady-state performance (344-388ms) is significantly better than cold-start
Implication: Pre-warming servers before production traffic prevents SLA violations

2. Variable Load Patterns:

Request distribution not uniform: 111 requests (slice 0) → 303 requests (slice 5)
Throughput varies with load: 3.0K - 4.3K tokens/sec
System handles variable load without significant degradation

3. No Performance Degradation:

TTFT remains stable from slice 1-6 (344-388ms range)
No upward trend in latency over time
No signs of memory leaks or resource exhaustion
System is healthy for sustained operation

Comparing Overall vs. Time-Sliced Metrics

Metric	Overall Average	Slice 0 (Cold)	Slice 1-6 (Warm)
TTFT	386ms	545ms (+41%)	344-388ms (baseline)
Latency	1,172ms	1,516ms	976-1,427ms

The hidden truth: Overall averages mask the 41% cold-start penalty!

Use Cases for Time-Slicing

Scenario 1: Detecting Warm-Up Effects

Problem: SLA violations in first minute of operation
Solution: Use time-slicing to quantify warm-up penalty
Action: Pre-warm servers or set longer health check delays

Scenario 2: Finding Memory Leaks

Problem: Performance degrades after hours of operation
Solution: Run long benchmark with time-slicing (--benchmark-duration 3600 --slice-duration 300)
Look for: Increasing TTFT/latency in later slices

Scenario 3: Load Pattern Validation

Problem: Trace-based tests with varying load
Solution: Time-slice to see if performance varies with request density
Look for: Correlation between requests/slice and latency

Best Practices

✅ Choose appropriate slice duration:

Too short (<5s): High variance, unstable metrics
Too long (>60s): Miss fine-grained patterns
Recommended: 10-30 seconds for most workloads

✅ Use with trace-based benchmarks:

Time-slicing + realistic traces = complete picture
See both overall AND time-evolving performance

✅ Compare cold vs. warm state:

Exclude slice 0 from steady-state SLA calculations
Report both cold-start and warm-state performance separately

✅ Monitor for degradation:

Upward trend in latency = resource issue
Flat or decreasing latency = healthy system

Summary

We’ve demonstrated 5 powerful AIPerf use cases:

Simple Profiling + Pareto Analysis: Find the sweet spot between user experience and resource utilization
Custom Percentile Analysis: Calculate any metric your organization needs from raw data
Trace-Based Benchmarking: Test with realistic production workload patterns
Goodput Analysis: Measure actual SLA compliance, not just raw throughput
Time-Sliced Analysis: Understand performance evolution and detect warm-up/degradation

Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.

Advanced Topics

In-Cluster Benchmarking

For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:

Eliminate network latency between client and server
Avoid ephemeral port exhaustion on client machines at extreme concurrency
Test true server capacity without client-side bottlenecks

Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.

Request Cancellation Testing

Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts). See the Request Cancellation Tutorial for detailed configuration.

$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --concurrency 10 \
>   --request-count 100 \
>   --request-cancellation-rate 20 \
>   --request-cancellation-delay 0.5 \
>   --isl 800 \
>   --osl 400 \
>   --tokenizer Qwen/Qwen3-0.6B

Parameters:

--request-cancellation-rate 20: Cancel 20% of requests
--request-cancellation-delay 0.5: Wait 0.5 seconds before cancelling

Use Cases:

Test server resource cleanup and connection pooling
Measure impact of cancellations on remaining requests
Validate graceful degradation under partial failures

Additional Features (v0.5.0)

Server-Side Metrics Collection

AIPerf can collect server-side metrics from Prometheus endpoints exposed by your inference server (e.g., vLLM, TensorRT-LLM). See the Server Metrics Guide for detailed configuration and the Server Metrics Reference for supported metrics.

$ # Auto-discovers Prometheus metrics endpoint from your server URL
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --concurrency 100 \
>   --request-count 1000
$ 
$ # Or specify additional metrics endpoints
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --server-metrics "http://server1:9090/metrics" "http://server2:9090/metrics"

Capabilities:

Prometheus integration: Automatically scrape metrics from /metrics endpoints
Multi-server support: Collect from multiple inference replicas
Resource utilization: Track GPU memory, request queuing, and batch sizes
GPU telemetry: See the GPU Telemetry Tutorial for DCGM integration

Automatic Plot Generation

Generate visualizations from your profiling results with the aiperf plot command. See the Plot Tutorial for all available plot types and configuration options.

$ # Generate plots from default ./artifacts directory
$ aiperf plot
$ 
$ # Generate plots from specific directories
$ aiperf plot --paths ./run1 ./run2
$ 
$ # Launch interactive dashboard
$ aiperf plot --dashboard
$ 
$ # Use dark theme
$ aiperf plot --theme dark
$ 
$ # Specify output directory
$ aiperf plot --output ./my_plots

Available Plot Types:

Latency distributions: Histograms and percentile bands for TTFT, ITL, request latency
Throughput curves: Token throughput and request throughput over time
Pareto analysis: TPS/GPU vs TPS/User trade-off visualization
Time-series analysis: Performance metrics over time slices
Scatter plots: Request-level latency vs. sequence length
Comparison views: Side-by-side analysis of multiple benchmark runs

Interactive Dashboard:

$ # Launch on localhost:8050
$ aiperf plot --dashboard
$ 
$ # Custom port (default 8050)
$ aiperf plot --dashboard --port 8080
$ 
$ # Custom host (coming soon)
$ # aiperf plot --dashboard --host 0.0.0.0 --port 8080

KV Cache Efficiency Testing

Test prefix caching and KV cache reuse patterns with trace synthesis and user-centric timing. See the Prefix Synthesis Tutorial for detailed configuration.

Trace Synthesis - Generate synthetic traces with controlled prefix-sharing patterns:

$ # Analyze existing trace for prefix statistics
$ aiperf analyze-trace mooncake_trace.jsonl --output-file analysis.json
$ 
$ # Synthesize new traces with controlled scaling
$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --input-file mooncake_trace.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --synthesis-speedup-ratio 2.0 \
>   --synthesis-prefix-len-multiplier 1.5 \
>   --tokenizer Qwen/Qwen3-0.6B

Synthesis Options:

--synthesis-speedup-ratio: Scale trace timing (2.0 = 2x faster replay)
--synthesis-prefix-len-multiplier: Scale shared prefix lengths
--synthesis-prefix-root-multiplier: Distribute traces across N independent prefix trees
--synthesis-prompt-len-multiplier: Scale unique prompt lengths
--synthesis-output-len-multiplier: Scale output lengths
--synthesis-max-isl: Filter out traces exceeding max input length
--synthesis-max-osl: Cap output length for traces exceeding max

User-Centric Timing Mode

Simulate realistic multi-turn conversations with controlled per-user timing for KV cache TTL testing. See the User-Centric Timing Tutorial for detailed configuration.

$ aiperf profile \
>   --model qwen3-0.6b \
>   --url $ENDPOINT_URL \
>   --endpoint-type chat \
>   --streaming \
>   --user-centric-rate 1.0 \
>   --num-users 15 \
>   --session-turns-mean 20 \
>   --shared-system-prompt-length 1000 \
>   --tokenizer Qwen/Qwen3-0.6B

Key Features:

Controlled per-user turn gaps: Each user waits exactly num_users / QPS seconds between turns
Shared system prompt: Test prefix caching benefits with --shared-system-prompt-length
Virtual history: Immediate steady-state without cold-start transient
Cache TTL testing: Verify KV cache retention at specific time intervals

Coming Soon

AIPerf is actively developing new capabilities:

Kubernetes-Native Benchmarking

Distributed load generation: Deploy multiple load-tester pods to simulate thousands of concurrent users
Large-scale workloads: Test production-scale traffic patterns without client-side bottlenecks
Automated orchestration: Kubernetes operators to manage benchmark lifecycles and resource allocation

Table of Contents

Setup: Installing AIPerf 0.5.0

Test Endpoint Details

Use Case 1: Simple Profiling with Static ISL/OSL

Command

Parameters Explained

Results

Key Takeaways

Evolution: Pareto Curve Analysis - Resource Efficiency vs. User Experience

The Experiment

Results: The Pareto Curve

Visualizing the Trade-off

Key Insights from the Pareto Curve

The Business Trade-off

What We Learned

Use Case 2: Auditing Raw Results - Custom Percentile Analysis

Understanding the Raw Data: profile_export.jsonl

Calculating P75 TTFT

Results from Our Benchmark

Key Takeaways

Use Case 3: Trace-Based Benchmarking with Mooncake

What is Mooncake Trace Data?

Understanding Block Hashing

The Mooncake arXiv Trace Dataset

Running a Trace-Based Benchmark

Key Differences from Synthetic Benchmarks

Why Trace-Based Benchmarking Matters

Real Benchmark Results: 5-Minute Mooncake Trace (5x Speed)

Key Observations from Trace-Based Testing

Use Case 4: Goodput Analysis - Measuring SLA Compliance

What is Goodput?

Real-World Example: Why Goodput > Throughput

Running Goodput Analysis

Goodput Results

Key Insights from Goodput Analysis

What Goodput Tells You That Metrics Don’t

Using Goodput for Capacity Planning

Adjusting SLOs for Your Business

Use Case 5: Time-Sliced Analysis - Performance Over Time

What is Time-Slicing?

Running Time-Sliced Analysis

Time-Sliced Results

Key Insights from Time-Sliced Analysis

Comparing Overall vs. Time-Sliced Metrics

Use Cases for Time-Slicing

Best Practices

Summary

Advanced Topics

In-Cluster Benchmarking

Request Cancellation Testing

Additional Features (v0.5.0)

Server-Side Metrics Collection

Automatic Plot Generation

KV Cache Efficiency Testing

User-Centric Timing Mode

Coming Soon

Kubernetes-Native Benchmarking