--- title: Dynamo Benchmarking subtitle: Benchmark and compare performance across Dynamo deployment configurations --- This guide shows how to benchmark Dynamo deployments using [AIPerf](https://github.com/ai-dynamo/aiperf), a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints. You can benchmark any combination of: - **DynamoGraphDeployments** - **External HTTP endpoints** (vLLM, llm-d, AIBrix, etc.) ## Choosing Your Benchmarking Approach **Client-side** runs benchmarks on your local machine via port-forwarding. **Server-side** runs benchmarks directly within the Kubernetes cluster using internal service URLs. **TLDR:** Need high performance/load testing? Server-side. Just quick testing/comparison? Client-side. ### Use Client-Side Benchmarking When: - You want to quickly test deployments - You want immediate access to results on your local machine - You're comparing external services or deployments (not necessarily just Dynamo deployments) - You need to run benchmarks from your laptop/workstation → **[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)** ### Use Server-Side Benchmarking When: - You have a development environment with kubectl access - You're doing performance validation with high load/speed requirements - You're experiencing timeouts or performance issues with client-side benchmarking - You want optimal network performance (no port-forwarding overhead) - You're running automated CI/CD pipelines - You need isolated execution environments - You want persistent result storage in the cluster → **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)** ### Quick Comparison | Feature | Client-Side | Server-Side | |---------|-------------|-------------| | **Location** | Your local machine | Kubernetes cluster | | **Network** | Port-forwarding required | Direct service DNS | | **Setup** | Quick and simple | Requires cluster resources | | **Performance** | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load | | **Isolation** | Shared environment | Isolated job execution | | **Results** | Local filesystem | Persistent volumes | | **Best for** | Light load | High load | ## AIPerf Overview [AIPerf](https://github.com/ai-dynamo/aiperf) is a standalone benchmarking tool available on [PyPI](https://pypi.org/project/aiperf/). It is pre-installed in Dynamo container images. Key features: - Measures latency, throughput, TTFT, inter-token latency, and more - Multiple load modes: concurrency, request-rate, trace replay - Automatic visualization with `aiperf plot` (Pareto curves, time series, GPU telemetry) - Interactive dashboard mode for real-time exploration - Arrival patterns (Poisson, constant, gamma) for realistic traffic simulation - Warmup phases, gradual ramping, and multi-URL load balancing **Important**: The `--model` parameter must match the model deployed at the endpoint. For full documentation, see the [AIPerf docs](https://github.com/ai-dynamo/aiperf/tree/main/docs). --- # Client-Side Benchmarking (Local) Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding. ## Prerequisites 1. **Dynamo container environment** - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally: ```bash pip install aiperf ``` 2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be: - DynamoGraphDeployments exposed via HTTP endpoints - External services (vLLM, llm-d, AIBrix, etc.) - Any HTTP endpoint serving OpenAI-compatible models ## User Workflow ### Step 1: Set Up Cluster and Deploy Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the [installation guide](/dynamo/dev/kubernetes-deployment/deployment-guide/detailed-installation-guide). Then deploy your DynamoGraphDeployments using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). ### Step 2: Port-Forward and Run a Single Benchmark > **Wait for model readiness.** Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (`curl http://localhost:8000/health`) — it should return `200 OK` before you proceed. ```bash # Port-forward the frontend service kubectl port-forward -n svc/ 8000:8000 > /dev/null 2>&1 & # Run a single benchmark aiperf profile \ --model \ --url http://localhost:8000 \ --endpoint-type chat \ --streaming \ --concurrency 10 \ --request-count 100 \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 ``` This produces results in `artifacts/` and prints a summary table to the console: ```text NVIDIA AIPerf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩ │ Time to First Token │ 234.56 │ 189.23 │ 298.45 │ 289.34 │ 267.12 │ 231.12 │ 28.45 │ │ (ms) │ │ │ │ │ │ │ │ │ Request Latency │ 1234.56 │ 987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │ 156.78 │ │ (ms) │ │ │ │ │ │ │ │ │ Inter Token Latency │ 15.67 │ 12.34 │ 19.45 │ 19.01 │ 18.23 │ 15.45 │ 1.89 │ │ (ms) │ │ │ │ │ │ │ │ │ Request Throughput │ 31.45 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ │ (requests/sec) │ │ │ │ │ │ │ │ └─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ ``` *Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use [server-side benchmarking](#server-side-benchmarking-in-cluster) for accurate performance measurement.* To stop the port-forward when done: `kill %1` (or `kill `). ### Step 3: Concurrency Sweep for Pareto Analysis To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (`max(c*3, 10)`): ```bash MODEL="" URL="http://localhost:8000" for c in 1 2 5 10 50 100; do aiperf profile \ --model "$MODEL" \ --url "$URL" \ --endpoint-type chat \ --streaming \ --concurrency $c \ --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 \ --artifact-dir "artifacts/deployment-a/c$c" done ``` **Note**: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point. ### Step 4: [If Comparative] Benchmark a Second Deployment Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (`kill %1`), then repeat: ```bash kubectl port-forward -n svc/ 8000:8000 > /dev/null 2>&1 & for c in 1 2 5 10 50 100; do aiperf profile \ --model "$MODEL" \ --url "$URL" \ --endpoint-type chat \ --streaming \ --concurrency $c \ --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 \ --artifact-dir "artifacts/deployment-b/c$c" done ``` ### Step 5: Generate Visualizations ```bash # Compare all runs — auto-detects multi-run directories aiperf plot artifacts/deployment-a artifacts/deployment-b # Or compare all subdirectories under a parent aiperf plot artifacts/ # Launch interactive dashboard for exploration aiperf plot artifacts/ --dashboard ``` AIPerf automatically generates plots based on available data: - **TTFT vs Throughput** — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons) - **Pareto Curves** — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add `--gpu-telemetry` during profiling if DCGM is running) - **Time series** — per-request TTFT, ITL, and latency over time (generated for single-run analysis) Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU): ![AIPerf Pareto Frontier](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/b7c4568cf0709ef464121da5844314308ca19c1401d6339e9f553d55b74f3580/pages-dev/assets/img/aiperf-pareto-frontier.png) See the [AIPerf Visualization Guide](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) for full details on plot customization, experiment classification, and themes. ## Use Cases - **Compare DynamoGraphDeployments** (e.g., aggregated vs disaggregated configurations) - **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM) - **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix) - **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B) - **Compare different hardware configurations** (e.g., H100 vs A100 vs H200) - **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations) ## AIPerf Quick Reference ### Commonly Used Options ```text aiperf profile [OPTIONS] REQUIRED: --model MODEL Model name (must match the deployed model) --url URL Endpoint URL (e.g., http://localhost:8000) COMMON OPTIONS: --endpoint-type TYPE Endpoint type: chat, completions, embeddings (default: chat) --streaming Enable streaming responses --concurrency N Number of concurrent requests --request-rate N Target requests per second (alternative to --concurrency) --request-count N Total number of requests to send --benchmark-duration N Run for N seconds instead of a fixed request count --synthetic-input-tokens-mean N Average input sequence length in tokens --output-tokens-mean N Average output sequence length in tokens --artifact-dir DIR Output directory for results (default: artifacts/) --warmup-request-count N Warmup requests before measurement --ui TYPE UI mode: dashboard, simple, none (default: dashboard) ``` For the complete CLI reference, see `aiperf profile --help` or the [CLI docs](https://github.com/ai-dynamo/aiperf/blob/main/docs/cli-options.md). ### Output Sequence Length To enforce a specific output length, pass `ignore_eos` and `min_tokens` via `--extra-inputs`: ```bash aiperf profile \ --model \ --url http://localhost:8000 \ --endpoint-type chat \ --streaming \ --concurrency 10 \ --output-tokens-mean 256 \ --extra-inputs max_tokens:256 \ --extra-inputs min_tokens:256 \ --extra-inputs ignore_eos:true ``` ### Understanding Results Each `aiperf profile` run produces an artifact directory containing: - **`profile_export_aiperf.json`** — Structured metrics (latency, throughput, TTFT, ITL, etc.) - **`profile_export.jsonl`** — Per-request raw data - **`profile_export_aiperf.csv`** — CSV format metrics Results are organized by the `--artifact-dir` you specify. For concurrency sweeps, a common pattern is: ```text artifacts/ ├── deployment-a/ │ ├── c1/ │ │ ├── profile_export_aiperf.json │ │ └── profile_export.jsonl │ ├── c10/ │ ├── c50/ │ └── c100/ ├── deployment-b/ │ ├── c1/ │ ├── c10/ │ ├── c50/ │ └── c100/ └── plots/ # Generated by aiperf plot ├── ttft_vs_throughput.png ├── pareto_curve_throughput_per_gpu_vs_latency.png # If GPU telemetry available └── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available ``` --- # Server-Side Benchmarking (In-Cluster) Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing. ## Prerequisites 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/dynamo/dev/kubernetes-deployment/deployment-guide)) 2. **Storage**: PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md)) 3. **Docker image** containing AIPerf (Dynamo runtime images include it) ## Quick Start ### Step 1: Deploy Your DynamoGraphDeployment Deploy using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns `200 OK`. ### Step 2: Configure and Run Benchmark Job First, edit `benchmarks/incluster/benchmark_job.yaml` to match your deployment: - **Model name**: Update the `MODEL` variable - **Service URL**: Update the `URL` variable (use `..svc.cluster.local:port` for cross-namespace access) - **Concurrency levels**: Adjust the `for c in ...` loop - **Docker image**: Update the `image` field if needed Then deploy: ```bash export NAMESPACE=benchmarking # Deploy the benchmark job kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE # Monitor the job kubectl logs -f job/dynamo-benchmark -n $NAMESPACE ``` ### Step 3: Retrieve Results ```bash # Create access pod (skip if already running) kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s # Download the results kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results # Cleanup kubectl delete pod pvc-access-pod -n $NAMESPACE ``` ### Step 4: Generate Plots ```bash aiperf plot ./results ``` ## Cross-Namespace Service Access When referencing services in other namespaces, use full Kubernetes DNS: ```bash # Same namespace --url http://vllm-agg-frontend:8000 # Different namespace --url http://vllm-agg-frontend.production.svc.cluster.local:8000 ``` ## Monitoring and Debugging ```bash # Check job status kubectl describe job dynamo-benchmark -n $NAMESPACE # Follow logs kubectl logs -f job/dynamo-benchmark -n $NAMESPACE # Check pod status kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark # Debug failed pod kubectl describe pod -n $NAMESPACE ``` ### Troubleshooting 1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running 2. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible 3. **Image pull issues**: Ensure the Docker image is accessible from the cluster 4. **Resource constraints**: Adjust resource limits if the job is being evicted ```bash # Check PVC status kubectl get pvc dynamo-pvc -n $NAMESPACE # Verify service exists and has endpoints kubectl get svc -n $NAMESPACE kubectl get endpoints -n $NAMESPACE ``` --- ## Testing with Mocker Backend For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for: - **Testing deployments** without expensive GPU infrastructure - **Developing and debugging** router, planner, or frontend logic - **CI/CD pipelines** that need to validate infrastructure without model execution - **Benchmarking framework validation** to ensure your setup works before using real backends The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference. See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options. --- ## Advanced AIPerf Features AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking: | Feature | Description | Docs | |---------|-------------|------| | Trace Replay | Replay production traces for deterministic benchmarking | [Trace Replay](https://github.com/ai-dynamo/aiperf/blob/main/docs/benchmark-modes/trace-replay.md) | | Arrival Patterns | Poisson, constant, gamma traffic distributions | [Arrival Patterns](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/arrival-patterns.md) | | Gradual Ramping | Smooth ramp-up of concurrency and request rate | [Ramping](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/ramping.md) | | Warmup Phase | Eliminate cold-start effects from measurements | [Warmup](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/warmup.md) | | Multi-URL Load Balancing | Distribute requests across multiple endpoints | [Multi-URL](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-url-load-balancing.md) | | GPU Telemetry | Collect DCGM metrics during benchmarking | [GPU Telemetry](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/gpu-telemetry.md) | | Goodput Analysis | SLO-based throughput measurement | [Goodput](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/goodput.md) | | Timeslice Analysis | Per-timeslice performance breakdown | [Timeslices](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/timeslices.md) | | Multi-Turn Conversations | Benchmark multi-turn chat workloads | [Multi-Turn](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-turn.md) | | Experiment Classification | Baseline vs treatment semantic colors in plots | [Plotting](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) |