This guide shows how to benchmark Dynamo deployments using AIPerf, a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.
You can benchmark any combination of:
Client-side runs benchmarks on your local machine via port-forwarding. Server-side runs benchmarks directly within the Kubernetes cluster using internal service URLs.
TLDR: Need high performance/load testing? Server-side. Just quick testing/comparison? Client-side.
→ Go to Client-Side Benchmarking (Local)
→ Go to Server-Side Benchmarking (In-Cluster)
AIPerf is a standalone benchmarking tool available on PyPI. It is pre-installed in Dynamo container images. Key features:
aiperf plot (Pareto curves, time series, GPU telemetry)Important: The --model parameter must match the model deployed at the endpoint.
For full documentation, see the AIPerf docs.
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
Dynamo container environment - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
HTTP endpoints - Ensure you have HTTP endpoints available for benchmarking. These can be:
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the installation guide. Then deploy your DynamoGraphDeployments using the deployment documentation.
Wait for model readiness. Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (
curl http://localhost:8000/health) — it should return200 OKbefore you proceed.
This produces results in artifacts/ and prints a summary table to the console:
Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use server-side benchmarking for accurate performance measurement.
To stop the port-forward when done: kill %1 (or kill <PID>).
To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (max(c*3, 10)):
Note: Adjust concurrency levels to match your deployment’s capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (kill %1), then repeat:
AIPerf automatically generates plots based on available data:
--gpu-telemetry during profiling if DCGM is running)Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):

See the AIPerf Visualization Guide for full details on plot customization, experiment classification, and themes.
For the complete CLI reference, see aiperf profile --help or the CLI docs.
To enforce a specific output length, pass ignore_eos and min_tokens via --extra-inputs:
Each aiperf profile run produces an artifact directory containing:
profile_export_aiperf.json — Structured metrics (latency, throughput, TTFT, ITL, etc.)profile_export.jsonl — Per-request raw dataprofile_export_aiperf.csv — CSV format metricsResults are organized by the --artifact-dir you specify. For concurrency sweeps, a common pattern is:
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
Deploy using the deployment documentation. Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns 200 OK.
First, edit benchmarks/incluster/benchmark_job.yaml to match your deployment:
MODEL variableURL variable (use <svc_name>.<namespace>.svc.cluster.local:port for cross-namespace access)for c in ... loopimage field if neededThen deploy:
When referencing services in other namespaces, use full Kubernetes DNS:
dynamo-pvc is properly configured and accessibleFor development and testing purposes, Dynamo provides a mocker backend that simulates LLM inference without requiring actual GPU resources. This is useful for:
The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
See the mocker directory for usage examples and configuration options.
AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking: