AIPerf: Comprehensive LLM Benchmarking
AIPerf: Comprehensive LLM Benchmarking
Presentation Date: November 13, 2025
Updated: February 2, 2026
Tool: AIPerf v0.5.0 | Architecture Overview | Full Documentation
Table of Contents
- Setup: Installing AIPerf 0.5.0
- Test Endpoint Details
- Use Case 1: Simple Profiling with Static ISL/OSL
- Use Case 2: Auditing Raw Results - Custom Percentile Analysis
- Use Case 3: Trace-Based Benchmarking with Mooncake
- Use Case 4: Goodput Analysis - Measuring SLA Compliance
- Use Case 5: Time-Sliced Analysis - Performance Over Time
- Summary
- Advanced Topics
- Additional Features (v0.5.0)
- Coming Soon
Setup: Installing AIPerf 0.5.0
Key Features in 0.5.0:
- ✅ Server-side metrics collection via Prometheus
- ✅ Automatic plot generation (
aiperf plotcommand) - ✅ KV cache efficiency testing with trace synthesis
- ✅ User-centric timing mode for multi-turn KV cache TTL testing
- ✅ Goodput analysis for SLA compliance measurement
- ✅ Time-sliced analysis for performance trends over time
📚 Documentation: See the full CLI Options Reference for all available parameters.
Test Endpoint Details
Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.
Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)
Inference Engine: vLLM v0.11.0
Architecture: 8-way data parallelism (8 independent vLLM replicas)
Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)
Deployment: Kubernetes on Nebius Cloud
Why this endpoint was chosen for the demo:
- Small model (~600M parameters) = high throughput for benchmarking
- 8 replicas = demonstrated horizontal scaling
- Public access = allowed live demonstration
Follow along locally: You can run a single vLLM replica to try the commands in this guide (results will differ from the multi-replica setup above):
Use Case 1: Simple Profiling with Static ISL/OSL
Goal: Measure baseline performance under controlled load
Command
Parameters Explained
Key Insight: This creates 100 “virtual users” sending 1,000 requests total with large payloads (1000→500 tokens).
Results
Key Takeaways
✅ TTFT = 347ms: Fast first token delivery - users see responses quickly
✅ Request Latency = 2.1s: Total time to generate 500 tokens per request
✅ System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users
✅ ITL = 3.57ms: Smooth, consistent token streaming
✅ P99 Latency = 3.6s: Even worst-case requests complete reasonably fast
What we learned:
- With 100 concurrent users and large payloads (1000→500 tokens), the system maintained stable performance
- P99 latency (3.6s) vs avg (2.1s) shows good consistency - only ~70% variance at tail
- Zero errors = reliable service under load
- 22.5K tokens/sec sustained throughput demonstrates 8-replica scaling effectiveness
Evolution: Pareto Curve Analysis - Resource Efficiency vs. User Experience
Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.
The Experiment
We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:
Results: The Pareto Curve
Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)
Visualizing the Trade-off
The Pareto frontier shows the inverse relationship between resource efficiency and user experience:
Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).
Key Insights from the Pareto Curve
✅ Low Concurrency (10-50):
- Poor resource utilization: Only 1,500-6,500 TPS/GPU = GPUs are underutilized
- Best user experience: 365 tokens/sec per user = very responsive
- Use case: Premium tier, low-latency applications
✅ Medium Concurrency (100-200):
- Balanced performance: ~11,000-18,000 TPS/GPU
- Good user experience: ~240-285 tokens/sec per user
- Sweet spot at c=200: Peak resource utilization (18K TPS/GPU) with acceptable user experience
- Use case: General production workloads
❌ High Concurrency (500+):
- Degraded resource utilization: TPS/GPU drops from 18K → 15K
- Poor user experience: 129 tokens/sec per user, TTFT = 1.1 seconds
- Queuing dominates: Request backlog causes both metrics to degrade
- Use case: Avoid this region unless cost is the only priority
The Business Trade-off
Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?
The c=200 “sweet spot”:
- 12x better resource utilization vs. c=10 (18K vs. 1.5K TPS/GPU)
- Only 35% reduction in per-user throughput (239 vs. 365 tokens/sec/user)
- TTFT still under 500ms for most requests
What We Learned
🔍 Performance is non-linear: Doubling concurrency doesn’t double throughput
📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead
⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience
🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities
Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!
Use Case 2: Auditing Raw Results - Custom Percentile Analysis
Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.
Goal: Calculate P75 TTFT from raw benchmark data.
Understanding the Raw Data: profile_export.jsonl
AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record. See the Working with Profile Exports tutorial for more analysis techniques.
Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.
Note: The metadata section may contain additional optional fields including was_cancelled, cancellation_time_ns, conversation_id, x_correlation_id, and timing fields like credit_issued_ns and request_ack_ns. The benchmark_phase field is either "warmup" or "profiling".
Calculating P75 TTFT
Results from Our Benchmark
Key Takeaways
✅ P75 = 422.87ms: 75% of requests get first token within this time
✅ Raw data access: Calculate ANY custom metric your org needs
✅ Full transparency: Every request is logged with complete metrics
✅ Easy parsing: Standard JSON format, one record per line
Why this matters:
- Different orgs have different SLA definitions
- P75 is a common SLA target (balance between typical and worst-case)
- AIPerf’s raw exports let you calculate ANY percentile or custom metric
- No need to re-run benchmarks for different analysis
Use Case 3: Trace-Based Benchmarking with Mooncake
Goal: Test your system under realistic production workload patterns using privacy-preserving traces.
📚 Documentation: See Benchmark Datasets for supported dataset formats and Trace Replay Mode for detailed configuration.
What is Mooncake Trace Data?
Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:
- Request arrival times
- Input/output token lengths
- Block hash IDs: Privacy-preserving identifiers for KV cache reuse patterns
Understanding Block Hashing
The Problem: Sharing production traces risks leaking sensitive user data.
Mooncake’s Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.
Example: Multi-turn conversation
Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.
The Mooncake arXiv Trace Dataset
Key characteristics of real production traffic:
✅ Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K
✅ Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K
✅ Consistent Load: ~393 requests/minute with relatively steady arrival rate
✅ Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)
This represents real-world patterns you won’t get from synthetic benchmarks:
- Multi-turn conversations (shared hash IDs across requests)
- Variable request sizes (not uniform 1K/500 like Use Case 1)
- Realistic timing (actual production arrival patterns)
- Long-context queries that stress-test model limits
Running a Trace-Based Benchmark
Key Differences from Synthetic Benchmarks
📚 Documentation: See the Timing Modes Reference for all supported timing modes.
Why Trace-Based Benchmarking Matters
✅ Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load
✅ KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates
✅ Capacity Planning: See performance under bursty traffic with variable request sizes
✅ Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data
Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity. See the Fixed Schedule Tutorial for more details.
Real Benchmark Results: 5-Minute Mooncake Trace (5x Speed)
We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:
Results:
Key Observations from Trace-Based Testing
✅ Highly Variable Request Sizes:
- Input: 890→32,236 tokens (36x range!)
- Output: 1→1,165 tokens
- Median input: 6,344 tokens (much larger than our synthetic 1K)
✅ Performance Under Real Load:
- TTFT = 407ms average despite 7K+ token median inputs
- System handled 4,675 tokens/sec with bursty, variable traffic
- P99 TTFT = 951ms (some large requests took longer, as expected)
✅ Realistic Failures:
- 75 requests (4%) exceeded Qwen3-0.6B’s 32K context limit
- This reveals a real operational constraint you’d miss with synthetic tests
- Production insight: Need longer-context model or request filtering
✅ Production Timing Patterns:
- Trace shows realistic request bursts and lulls
- Not constant load like
--concurrency 100 - More representative of actual user traffic patterns
What we learned from trace-based vs. synthetic testing:
- Use Case 1 (synthetic): 100% success, uniform 1K→500 tokens, 22.5K TPS
- Use Case 3 (trace): 96% success, variable 890→32K input tokens, 4.7K TPS, revealed context window issues
Trace-based testing exposes real-world challenges that synthetic benchmarks hide!
Use Case 4: Goodput Analysis - Measuring SLA Compliance
Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.
📚 Documentation: See the Goodput Tutorial for additional examples and SLO configuration options.
What is Goodput?
Goodput = The fraction of requests that meet ALL specified SLA thresholds.
Why it matters:
- Throughput tells you how many requests/sec your system handles
- Goodput tells you how many requests/sec deliver acceptable user experience
- A system can have high throughput but low goodput if most requests miss SLAs!
Definition (from DistServe paper):
“Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service.”
Real-World Example: Why Goodput > Throughput
Imagine two systems serving 1000 requests/min:
- System A: 950 requests under SLA, 50 requests timeout → 95% goodput
- System B: 500 requests under SLA, 500 requests slow → 50% goodput
Both have the same throughput, but System A delivers 2x better user experience!
Running Goodput Analysis
We’ll use the same Mooncake trace, but add SLO thresholds:
Goodput Results
Key Insights from Goodput Analysis
Goodput vs. Throughput:
Understanding the results:
- SLO Thresholds: TTFT ≤ 370ms AND Request Latency ≤ 648ms
- Average TTFT: 428ms (above threshold)
- Median Latency: 691ms (above threshold)
- 72% of requests failed to meet at least one SLO
- Raw throughput doesn’t reveal this user experience gap!
How SLO compliance works:
- Requests must meet ALL SLO criteria to count toward goodput
- A request with TTFT=350ms but latency=700ms fails (missed latency SLO)
- A request with TTFT=400ms but latency=600ms fails (missed TTFT SLO)
- Only requests with TTFT≤370ms AND latency≤648ms count as goodput
What Goodput Tells You That Metrics Don’t
Using Goodput for Capacity Planning
Question: How many servers do I need to handle 1000 req/sec with 95% goodput?
Without goodput analysis:
- Measure throughput: 26.67 req/sec per server
- Calculate: 1000 / 26.67 = 38 servers
- Problem: This assumes all requests meet SLAs! ❌
With goodput analysis:
- Measure goodput: 7.43 req/sec per server (28% of throughput)
- Calculate: 1000 / 7.43 = 135 servers
- Reality: Need 3.5x more capacity to meet SLAs ✅
The cost of ignoring goodput: Underprovisioning by 250%!
Adjusting SLOs for Your Business
Different use cases need different SLOs:
Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.
Use Case 5: Time-Sliced Analysis - Performance Over Time
Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.
📚 Documentation: See the Timeslices Tutorial for configuration options and the Warmup Tutorial for managing cold-start effects.
What is Time-Slicing?
Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.
Why it matters:
- Detect warm-up effects: Identify cold-start latency vs. steady-state performance
- Spot degradation: Find memory leaks or resource exhaustion over time
- Understand load patterns: See how performance changes as traffic evolves
- Validate SLAs over time: Ensure consistent performance, not just averages
Running Time-Sliced Analysis
We’ll use the same Mooncake trace with 10-second time slices:
Output: AIPerf generates additional files:
profile_export_aiperf_timeslices.csv- Time-series data in tidy formatprofile_export_aiperf_timeslices.json- Hierarchical time-series data
Time-Sliced Results
Key Insights from Time-Sliced Analysis
1. Warm-Up Effect Detected:
Why this matters:
- First 10 seconds show 545ms TTFT (above target)
- Performance improves 30% after warm-up
- Steady-state performance (344-388ms) is significantly better than cold-start
- Implication: Pre-warming servers before production traffic prevents SLA violations
2. Variable Load Patterns:
- Request distribution not uniform: 111 requests (slice 0) → 303 requests (slice 5)
- Throughput varies with load: 3.0K - 4.3K tokens/sec
- System handles variable load without significant degradation
3. No Performance Degradation:
- TTFT remains stable from slice 1-6 (344-388ms range)
- No upward trend in latency over time
- No signs of memory leaks or resource exhaustion
- System is healthy for sustained operation
Comparing Overall vs. Time-Sliced Metrics
The hidden truth: Overall averages mask the 41% cold-start penalty!
Use Cases for Time-Slicing
Scenario 1: Detecting Warm-Up Effects
Scenario 2: Finding Memory Leaks
Scenario 3: Load Pattern Validation
Best Practices
✅ Choose appropriate slice duration:
- Too short (<5s): High variance, unstable metrics
- Too long (>60s): Miss fine-grained patterns
- Recommended: 10-30 seconds for most workloads
✅ Use with trace-based benchmarks:
- Time-slicing + realistic traces = complete picture
- See both overall AND time-evolving performance
✅ Compare cold vs. warm state:
- Exclude slice 0 from steady-state SLA calculations
- Report both cold-start and warm-state performance separately
✅ Monitor for degradation:
- Upward trend in latency = resource issue
- Flat or decreasing latency = healthy system
Summary
We’ve demonstrated 5 powerful AIPerf use cases:
- Simple Profiling + Pareto Analysis: Find the sweet spot between user experience and resource utilization
- Custom Percentile Analysis: Calculate any metric your organization needs from raw data
- Trace-Based Benchmarking: Test with realistic production workload patterns
- Goodput Analysis: Measure actual SLA compliance, not just raw throughput
- Time-Sliced Analysis: Understand performance evolution and detect warm-up/degradation
Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.
Advanced Topics
In-Cluster Benchmarking
For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:
- Eliminate network latency between client and server
- Avoid ephemeral port exhaustion on client machines at extreme concurrency
- Test true server capacity without client-side bottlenecks
Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.
Request Cancellation Testing
Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts). See the Request Cancellation Tutorial for detailed configuration.
Parameters:
--request-cancellation-rate 20: Cancel 20% of requests--request-cancellation-delay 0.5: Wait 0.5 seconds before cancelling
Use Cases:
- Test server resource cleanup and connection pooling
- Measure impact of cancellations on remaining requests
- Validate graceful degradation under partial failures
Additional Features (v0.5.0)
Server-Side Metrics Collection
AIPerf can collect server-side metrics from Prometheus endpoints exposed by your inference server (e.g., vLLM, TensorRT-LLM). See the Server Metrics Guide for detailed configuration and the Server Metrics Reference for supported metrics.
Capabilities:
- Prometheus integration: Automatically scrape metrics from
/metricsendpoints - Multi-server support: Collect from multiple inference replicas
- Resource utilization: Track GPU memory, request queuing, and batch sizes
- GPU telemetry: See the GPU Telemetry Tutorial for DCGM integration
Automatic Plot Generation
Generate visualizations from your profiling results with the aiperf plot command. See the Plot Tutorial for all available plot types and configuration options.
Available Plot Types:
- Latency distributions: Histograms and percentile bands for TTFT, ITL, request latency
- Throughput curves: Token throughput and request throughput over time
- Pareto analysis: TPS/GPU vs TPS/User trade-off visualization
- Time-series analysis: Performance metrics over time slices
- Scatter plots: Request-level latency vs. sequence length
- Comparison views: Side-by-side analysis of multiple benchmark runs
Interactive Dashboard:
KV Cache Efficiency Testing
Test prefix caching and KV cache reuse patterns with trace synthesis and user-centric timing. See the Prefix Synthesis Tutorial for detailed configuration.
Trace Synthesis - Generate synthetic traces with controlled prefix-sharing patterns:
Synthesis Options:
--synthesis-speedup-ratio: Scale trace timing (2.0 = 2x faster replay)--synthesis-prefix-len-multiplier: Scale shared prefix lengths--synthesis-prefix-root-multiplier: Distribute traces across N independent prefix trees--synthesis-prompt-len-multiplier: Scale unique prompt lengths--synthesis-max-isl: Filter out traces exceeding max input length--synthesis-max-osl: Cap output length for traces exceeding max
User-Centric Timing Mode
Simulate realistic multi-turn conversations with controlled per-user timing for KV cache TTL testing. See the User-Centric Timing Tutorial for detailed configuration.
Key Features:
- Controlled per-user turn gaps: Each user waits exactly
num_users / QPSseconds between turns - Shared system prompt: Test prefix caching benefits with
--shared-system-prompt-length - Virtual history: Immediate steady-state without cold-start transient
- Cache TTL testing: Verify KV cache retention at specific time intervals
Coming Soon
AIPerf is actively developing new capabilities:
Kubernetes-Native Benchmarking
- Distributed load generation: Deploy multiple load-tester pods to simulate thousands of concurrent users
- Large-scale workloads: Test production-scale traffic patterns without client-side bottlenecks
- Automated orchestration: Kubernetes operators to manage benchmark lifecycles and resource allocation