This benchmarking framework lets you compare performance across any combination of:
Dynamo provides two benchmarking approaches to suit different use cases: client-side and server-side. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case.
TLDR: Need high performance/load testing? Server-side. Just quick testing/comparison? Client-side.
→ Go to Client-Side Benchmarking (Local)
→ Go to Server-Side Benchmarking (In-Cluster)
The framework is a Python-based wrapper around aiperf that:
Default sequence lengths: Input: 2000 tokens, Output: 256 tokens (configurable with --isl and --osl)
Important: The --model parameter configures AIPerf for benchmarking and provides logging context. The default --model value in the benchmarking script is Qwen/Qwen3-0.6B, but it must match the model deployed at the endpoint(s).
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
Dynamo container environment - You must be running inside a Dynamo container with the benchmarking tools pre-installed.
HTTP endpoints - Ensure you have HTTP endpoints available for benchmarking. These can be:
Benchmark dependencies - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Cloud platform. First follow the installation guide to install Dynamo Cloud, then use deploy/utils/README to set up benchmarking resources.
Deploy your DynamoGraphDeployments separately using the deployment documentation. Each deployment should have a frontend service exposed.
If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration.
The benchmarking framework supports various comparative analysis scenarios:
plots is reserved.--model parameter configures AIPerf for testing and logging, and must match the model deployed at the endpointThe Python benchmarking module:
The Python plotting module:
<OUTPUT_DIR>/plots/The plotting script supports several options for customizing which experiments to visualize:
Available Options:
--data-dir: Directory containing benchmark results (required)--benchmark-name: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir.--output-dir: Custom output directory for plots (defaults to data-dir/plots)Note: If --benchmark-name is not specified, the script will plot all subdirectories found in the data directory.
The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script’s --model parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with --isl and --osl flags if needed for your specific workload.
The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool.
The plotting system supports up to 12 different benchmarks in a single comparison.
You can customize the concurrency levels using the CONCURRENCIES environment variable:
After benchmarking completes, check ./benchmarks/results/ (or your custom output directory):
The plotting script uses the --benchmark-name as the experiment name in all generated plots. For example:
--benchmark-name aggregated → plots will show “aggregated” as the label--benchmark-name vllm-disagg → plots will show “vllm-disagg” as the labelThis allows you to easily identify and compare different configurations in the visualization plots.
Raw data is organized by deployment/benchmark type and concurrency level:
For Any Benchmarking (uses your custom benchmark name):
Example with actual benchmark names:
Each concurrency directory contains:
profile_export_aiperf.json - Structured metrics from AIPerfprofile_export_aiperf.csv - CSV format metrics from AIPerfprofile_export.json - Raw AIPerf resultsinputs.json - Generated test inputsServer-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
The server-side benchmarking solution:
benchmarks.utils.benchmark)dynamo-pvcsvc_name.namespace.svc.cluster.localDeploy your DynamoGraphDeployment using the deployment documentation. Ensure it has a frontend service exposed.
Note: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the container build instructions, push it to your container registry, then update the image field in benchmarks/incluster/benchmark_job.yaml to use your built image tag.
To customize the benchmark parameters, edit the benchmarks/incluster/benchmark_job.yaml file and modify:
"Qwen/Qwen3-0.6B" in the args section"qwen3-0p6b-vllm-agg" to your desired benchmark name"vllm-agg-frontend:8000" so the service URL matches your deployed serviceThen deploy:
This will create visualization plots. For more details on interpreting these plots, see the Summary and Plots section above.
Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format:
DNS Format: <service-name>.<namespace>.svc.cluster.local:port
This allows you to:
The benchmark job is configured directly in the YAML file.
Qwen/Qwen3-0.6Bqwen3-0p6b-vllm-aggvllm-agg-frontend:8000nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1To customize the benchmark, edit benchmarks/incluster/benchmark_job.yaml:
--model argument--benchmark-name argument--endpoint-url argument (use <svc_name>.<namespace>.svc.cluster.local:port for cross-namespace access)To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together.
Results are stored in /data/results and follow the same structure as client-side benchmarking:
dynamo-pvc is properly configured and accessibleThe built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
Extend the workflow: Modify benchmarks/utils/workflow.py to add custom deployment types or metrics collection
Generate different plots: Modify benchmarks/utils/plot.py to generate a different set of plots for whatever you wish to visualize.
Direct module usage: Use individual Python modules (benchmarks.utils.benchmark, benchmarks.utils.plot) for granular control over each step of the benchmarking process.
The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
For development and testing purposes, Dynamo provides a mocker backend that simulates LLM inference without requiring actual GPU resources. This is useful for:
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the mocker directory for usage examples and configuration options.