AIPerf: Comprehensive LLM Benchmarking
AIPerf: Comprehensive LLM Benchmarking
AIPerf: Comprehensive LLM Benchmarking
Presentation Date: November 13, 2025
Updated: February 2, 2026
Tool: AIPerf v0.5.0 | Architecture Overview | Full Documentation
Key Features in 0.5.0:
aiperf plot command)📚 Documentation: See the full CLI Options Reference for all available parameters.
Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.
Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)
Inference Engine: vLLM v0.11.0
Architecture: 8-way data parallelism (8 independent vLLM replicas)
Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)
Deployment: Kubernetes on Nebius Cloud
Why this endpoint was chosen for the demo:
Follow along locally: You can run a single vLLM replica to try the commands in this guide (results will differ from the multi-replica setup above):
Goal: Measure baseline performance under controlled load
Key Insight: This creates 100 “virtual users” sending 1,000 requests total with large payloads (1000→500 tokens).
✅ TTFT = 347ms: Fast first token delivery - users see responses quickly
✅ Request Latency = 2.1s: Total time to generate 500 tokens per request
✅ System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users
✅ ITL = 3.57ms: Smooth, consistent token streaming
✅ P99 Latency = 3.6s: Even worst-case requests complete reasonably fast
What we learned:
Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.
We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:
Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)
The Pareto frontier shows the inverse relationship between resource efficiency and user experience:
Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).
✅ Low Concurrency (10-50):
✅ Medium Concurrency (100-200):
❌ High Concurrency (500+):
Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?
The c=200 “sweet spot”:
🔍 Performance is non-linear: Doubling concurrency doesn’t double throughput
📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead
⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience
🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities
Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!
Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.
Goal: Calculate P75 TTFT from raw benchmark data.
AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record. See the Working with Profile Exports tutorial for more analysis techniques.
Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.
Note: The metadata section may contain additional optional fields including was_cancelled, cancellation_time_ns, conversation_id, x_correlation_id, and timing fields like credit_issued_ns and request_ack_ns. The benchmark_phase field is either "warmup" or "profiling".
✅ P75 = 422.87ms: 75% of requests get first token within this time
✅ Raw data access: Calculate ANY custom metric your org needs
✅ Full transparency: Every request is logged with complete metrics
✅ Easy parsing: Standard JSON format, one record per line
Why this matters:
Goal: Test your system under realistic production workload patterns using privacy-preserving traces.
📚 Documentation: See Benchmark Datasets for supported dataset formats and Trace Replay Mode for detailed configuration.
Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:
The Problem: Sharing production traces risks leaking sensitive user data.
Mooncake’s Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.
Example: Multi-turn conversation
Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.
Key characteristics of real production traffic:
✅ Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K
✅ Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K
✅ Consistent Load: ~393 requests/minute with relatively steady arrival rate
✅ Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)
This represents real-world patterns you won’t get from synthetic benchmarks:
📚 Documentation: See the Timing Modes Reference for all supported timing modes.
✅ Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load
✅ KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates
✅ Capacity Planning: See performance under bursty traffic with variable request sizes
✅ Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data
Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity. See the Fixed Schedule Tutorial for more details.
We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:
Results:
✅ Highly Variable Request Sizes:
✅ Performance Under Real Load:
✅ Realistic Failures:
✅ Production Timing Patterns:
--concurrency 100What we learned from trace-based vs. synthetic testing:
Trace-based testing exposes real-world challenges that synthetic benchmarks hide!
Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.
📚 Documentation: See the Goodput Tutorial for additional examples and SLO configuration options.
Goodput = The fraction of requests that meet ALL specified SLA thresholds.
Why it matters:
Definition (from DistServe paper):
“Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service.”
Imagine two systems serving 1000 requests/min:
Both have the same throughput, but System A delivers 2x better user experience!
We’ll use the same Mooncake trace, but add SLO thresholds:
Goodput vs. Throughput:
Understanding the results:
How SLO compliance works:
Question: How many servers do I need to handle 1000 req/sec with 95% goodput?
Without goodput analysis:
With goodput analysis:
The cost of ignoring goodput: Underprovisioning by 250%!
Different use cases need different SLOs:
Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.
Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.
📚 Documentation: See the Timeslices Tutorial for configuration options and the Warmup Tutorial for managing cold-start effects.
Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.
Why it matters:
We’ll use the same Mooncake trace with 10-second time slices:
Output: AIPerf generates additional files:
profile_export_aiperf_timeslices.csv - Time-series data in tidy formatprofile_export_aiperf_timeslices.json - Hierarchical time-series data1. Warm-Up Effect Detected:
Why this matters:
2. Variable Load Patterns:
3. No Performance Degradation:
The hidden truth: Overall averages mask the 41% cold-start penalty!
Scenario 1: Detecting Warm-Up Effects
Scenario 2: Finding Memory Leaks
Scenario 3: Load Pattern Validation
✅ Choose appropriate slice duration:
✅ Use with trace-based benchmarks:
✅ Compare cold vs. warm state:
✅ Monitor for degradation:
We’ve demonstrated 5 powerful AIPerf use cases:
Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.
For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:
Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.
Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts). See the Request Cancellation Tutorial for detailed configuration.
Parameters:
--request-cancellation-rate 20: Cancel 20% of requests--request-cancellation-delay 0.5: Wait 0.5 seconds before cancellingUse Cases:
AIPerf can collect server-side metrics from Prometheus endpoints exposed by your inference server (e.g., vLLM, TensorRT-LLM). See the Server Metrics Guide for detailed configuration and the Server Metrics Reference for supported metrics.
Capabilities:
/metrics endpointsGenerate visualizations from your profiling results with the aiperf plot command. See the Plot Tutorial for all available plot types and configuration options.
Available Plot Types:
Interactive Dashboard:
Test prefix caching and KV cache reuse patterns with trace synthesis and user-centric timing. See the Prefix Synthesis Tutorial for detailed configuration.
Trace Synthesis - Generate synthetic traces with controlled prefix-sharing patterns:
Synthesis Options:
--synthesis-speedup-ratio: Scale trace timing (2.0 = 2x faster replay)--synthesis-prefix-len-multiplier: Scale shared prefix lengths--synthesis-prefix-root-multiplier: Distribute traces across N independent prefix trees--synthesis-prompt-len-multiplier: Scale unique prompt lengths--synthesis-max-isl: Filter out traces exceeding max input length--synthesis-max-osl: Cap output length for traces exceeding maxSimulate realistic multi-turn conversations with controlled per-user timing for KV cache TTL testing. See the User-Centric Timing Tutorial for detailed configuration.
Key Features:
num_users / QPS seconds between turns--shared-system-prompt-lengthAIPerf is actively developing new capabilities: