This guide shows how to benchmark Dynamo deployments using AIPerf, a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.
You can benchmark any combination of:
If you are benchmarking a supported model, backend, hardware target, or Dynamo feature, check Dynamo Recipes before writing deployment and benchmark manifests from scratch. Recipes provide known-good starting points, including:
deploy.yaml manifests for tuned DynamoGraphDeployment configurationsperf.yaml benchmark jobs for many recipesUse recipes when you want a validated baseline or a feature comparison such as aggregated vs. disaggregated serving, KV-aware routing, or embedding cache. Use this guide when you need to benchmark a custom DGD, compare arbitrary HTTP endpoints, or adapt an existing recipe to your own environment.
Client-side runs benchmarks on your local machine via port-forwarding. Server-side runs benchmarks directly within the Kubernetes cluster using internal service URLs.
TLDR: Need high performance/load testing? Server-side. Just quick testing/comparison? Client-side.
→ Go to Client-Side Benchmarking (Local)
→ Go to Server-Side Benchmarking (In-Cluster)
AIPerf is a standalone benchmarking tool available on PyPI. It is pre-installed in Dynamo container images. Key features:
aiperf plot (Pareto curves, time series, GPU telemetry)Important: The --model parameter must match the model deployed at the endpoint.
For full documentation, see the AIPerf docs.
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
Dynamo container environment - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
HTTP endpoints - Ensure you have HTTP endpoints available for benchmarking. These can be:
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the installation guide. Then deploy your DynamoGraphDeployment.
Prefer Dynamo Recipes
when a recipe matches your model, backend, hardware, and serving mode. Recipes
include tuned deploy.yaml manifests and, in many cases, matching perf.yaml
benchmark jobs that you can run or adapt. If no recipe matches, start from the
Deployment Overview or the backend
examples in examples/backends.
Wait for model readiness. Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (
curl http://localhost:8000/health) — it should return200 OKbefore you proceed.
This produces results in artifacts/ and prints a summary table to the console:
Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use server-side benchmarking for accurate performance measurement.
To stop the port-forward when done: kill %1 (or kill <PID>).
To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (max(c*3, 10)):
Note: Adjust concurrency levels to match your deployment’s capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (kill %1), then repeat:
AIPerf automatically generates plots based on available data:
--gpu-telemetry during profiling if DCGM is running)Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):

See the AIPerf Visualization Guide for full details on plot customization, experiment classification, and themes.
For the complete CLI reference, see aiperf profile --help or the CLI docs.
To enforce a specific output length, pass ignore_eos and min_tokens via --extra-inputs:
Each aiperf profile run produces an artifact directory containing:
profile_export_aiperf.json — Structured metrics (latency, throughput, TTFT, ITL, etc.)profile_export.jsonl — Per-request raw dataprofile_export_aiperf.csv — CSV format metricsResults are organized by the --artifact-dir you specify. For concurrency sweeps, a common pattern is:
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
Deploy a DynamoGraphDeployment using a matching
Dynamo Recipe, the
Deployment Overview, or the backend
examples in examples/backends.
Ensure it has a frontend service exposed and the model is fully loaded before
running benchmarks — check pod logs or verify the health endpoint returns
200 OK.
If your recipe includes a perf.yaml, start from that benchmark job because it
already encodes the model, endpoint, workload shape, and result collection
expected by the recipe. Otherwise, use the generic job below.
First, edit benchmarks/incluster/benchmark_job.yaml to match your deployment:
MODEL variableURL variable (use <svc_name>.<namespace>.svc.cluster.local:port for cross-namespace access)for c in ... loopimage field if neededThen deploy:
When referencing services in other namespaces, use full Kubernetes DNS:
dynamo-pvc is properly configured and accessibleFor development and testing purposes, Dynamo provides DynoSim and the mocker backend to simulate LLM inference without requiring actual GPU resources. This is useful for:
Mocker is the live simulated engine in DynoSim: it mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference. Use DynoSim Runs for one simulated workload/config trial and DynoSim Sweeps when you want to search across many candidate configurations.
See Live Simulation with Mocker for usage examples and configuration options.
AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking: