Dynamo Benchmarking Guide#
This benchmarking framework lets you compare performance across any combination of:
DynamoGraphDeployments (automatically deployed from your manifests)
External HTTP endpoints (existing services, vLLM, TensorRT-LLM, etc.)
You can mix and match these in a single benchmark run using custom labels. Configure your DynamoGraphDeployment manifests for your specific models, hardware, and parallelization needs.
What This Tool Does#
The framework is a wrapper around genai-perf that:
Deploys user-specified
DynamoGraphDeploymentsautomaticallyBenchmarks any HTTP endpoints (no deployment needed)
Runs concurrency sweeps across configurable load levels
Generates comparison plots with your custom labels
Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.)
Runs locally and connects to your Kubernetes deployments/endpoints
Default sequence lengths: Input: 2000 tokens, Output: 256 tokens (configurable with --isl and --osl)
Important: The --model parameter configures GenAI-Perf for benchmarking and provides logging context. The actual model loaded is determined by your deployment manifests. Only one model can be benchmarked at a time across all inputs to ensure fair comparison. The default --model value in the benchmarking script is Qwen/Qwen3-0.6B, but it must match the model in the manifest(s) and the model deployed at the endpoint(s).
Prerequisites#
Kubernetes cluster with NVIDIA GPUs and Dynamo namespace setup - You need a Kubernetes cluster with eligible NVIDIA GPUs and a properly configured namespace for Dynamo benchmarking. See the deploy/utils/README for complete setup instructions.
kubectl access - You need
kubectlinstalled and configured to access your Kubernetes cluster. All other required tools (GenAI-Perf, Python, etc.) are included in the Dynamo containers. If you are not working within a Dynamo container, you can install the necessary requirements usingdeploy/utils/requirements.txt. Note: if you are on Ubuntu 22.04 or lower, you will also need to build perf_analyzer from source.
Quick Start Examples#
The tool can be used to deploy, benchmark and compare Dynamo deployments (DynamoGraphDeployments) on a Kubernetes cluster as well as benchmark and compare servers deployed separately given a URL. In the examples below, Dynamo deployments are specified with a yaml and servers deployed separately by URL.
export NAMESPACE=benchmarking
# Compare multiple DynamoGraphDeployments of a single backend
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input agg=components/backends/vllm/deploy/agg.yaml \
--input disagg=components/backends/vllm/deploy/disagg.yaml
# Compare different backend types (vLLM vs TensorRT-LLM)
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input vllm-disagg=components/backends/vllm/deploy/disagg.yaml \
--input trtllm-disagg=components/backends/trtllm/deploy/disagg.yaml
# Compare Dynamo deployment vs existing deployment (external endpoint)
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input dynamo=components/backends/vllm/deploy/disagg.yaml \
--input vllm-baseline=http://localhost:8000
# Compare three different configurations
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input dynamo-agg=components/backends/vllm/deploy/agg.yaml \
--input dynamo-disagg=components/backends/vllm/deploy/disagg.yaml \
--input external-vllm=http://localhost:8000
# Benchmark single external endpoint
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input production-api=http://your-api:8000
# Custom model and sequence lengths
./benchmarks/benchmark.sh --namespace $NAMESPACE \
--input my-setup=my-custom-manifest.yaml \
--model "meta-llama/Meta-Llama-3-8B" --isl 512 --osl 256
Key: Configure your manifests for your specific models, hardware, and parallelization strategy before benchmarking.
Important: Image Accessibility#
Ensure container images in your DynamoGraphDeployment manifests are accessible:
Public images: Use Dynamo NGC public releases
Custom registries: Configure proper credentials in your Kubernetes namespace
Configuration and Usage#
Command Line Options#
./benchmarks/benchmark.sh --namespace NAMESPACE --input <label>=<manifest_path_or_endpoint> [--input <label>=<manifest_path_or_endpoint>]... [OPTIONS]
REQUIRED:
-n, --namespace NAMESPACE Kubernetes namespace
--input <label>=<manifest_path_or_endpoint> Benchmark input with custom label
- <label>: becomes the name/label in plots
- <manifest_path_or_endpoint>: either a DynamoGraphDeployment manifest or HTTP endpoint URL
Can be specified multiple times for comparisons
OPTIONS:
-h, --help Show help message and examples
-m, --model MODEL Model name for GenAI-Perf configuration and logging (default: Qwen/Qwen3-0.6B)
NOTE: This must match the model configured in your deployment manifests and endpoints
-i, --isl LENGTH Input sequence length (default: 2000)
-s, --std STDDEV Input sequence standard deviation (default: 10)
-o, --osl LENGTH Output sequence length (default: 256)
-d, --output-dir DIR Output directory (default: ./benchmarks/results)
--verbose Enable verbose output
Important Notes#
Custom Labels: Each input must have a unique label that becomes the name in plots and results
Label Restrictions: Labels can only contain letters, numbers, hyphens, and underscores. The label
plotsis reserved.Input Types: Supports DynamoGraphDeployment manifests for automatic deployment, or HTTP endpoints for existing services
Model Parameter: The
--modelparameter configures GenAI-Perf for testing and logging, not deployment (deployment model is determined by the manifest files)Standalone Deployments: For non-Dynamo backends (vLLM, TensorRT-LLM, SGLang, etc.), you must deploy them manually following their respective Kubernetes deployment guides. The benchmarking framework only supports automatic deployment of DynamoGraphDeployments.
Single Model Requirement: Only one model can be benchmarked at a time across all inputs to ensure fair comparison.
What Happens During Benchmarking#
The script automatically:
Deploys each DynamoGraphDeployment configuration to Kubernetes if manifests are passed in
Benchmarks using GenAI-Perf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250)
Measures key metrics: latency, throughput, time-to-first-token
Generates comparison plots using your custom labels in
./benchmarks/results/plots/Cleans up deployments when complete
GPU Resource Usage#
Important: Models are deployed and benchmarked sequentially, not in parallel. This means:
One deployment at a time: Each DynamoGraphDeployment is deployed, benchmarked, and cleaned up before the next one starts
Full GPU access: Each deployment gets exclusive access to all available GPUs during its benchmark run
Resource isolation: No resource conflicts between different deployment configurations
Fair comparison: Each configuration is tested under identical resource conditions
This sequential approach ensures:
Accurate performance measurements without interference between deployments
Consistent resource allocation for fair comparison across different configurations
Simplified resource management without complex GPU scheduling
Reliable cleanup between benchmark runs
If you need to benchmark multiple configurations simultaneously, consider using separate Kubernetes namespaces or running benchmarks on different clusters.
Results Clearing Behavior#
Important: The benchmark script automatically clears the output directory before each run to ensure clean, reproducible results. This means:
Previous benchmark results in the same output directory will be completely removed
Each benchmark run starts with a clean slate
Results from different runs are not mixed or accumulated
If you want to preserve results from previous runs, use different output directories with the --output-dir flag.
Using Your Own Models and Configuration#
The benchmarking framework supports any HuggingFace-compatible LLM model. To benchmark your own custom deployment:
Edit your deployment YAML files to specify your model in the
--modelargument of the container commandUse the corresponding model name in the benchmark script’s
--modelparameter
Note: You can override the default sequence lengths (2000/256 tokens) with --isl and --osl flags if needed for your specific workload.
Direct Python Execution#
For direct control over the benchmark workflow:
# Endpoint benchmarking
python3 -u -m benchmarks.utils.benchmark \
--input trtllm=http://your-endpoint:8000 \
--namespace $NAMESPACE \
--isl 2000 \
--std 10 \
--osl 256 \
--output-dir $OUTPUT_DIR
# Deployment benchmarking (any combination)
python3 -u -m benchmarks.utils.benchmark \
--input agg=$AGG_CONFIG \
--input disagg=$DISAGG_CONFIG \
--namespace $NAMESPACE \
--isl 2000 \
--std 10 \
--osl 256 \
--output-dir $OUTPUT_DIR
# Generate plots separately
python3 -m benchmarks.utils.plot --data-dir $OUTPUT_DIR
Comparison Limitations#
The plotting system supports up to 12 different inputs in a single comparison. If you need to compare more than 12 different deployments/endpoints, consider running separate benchmark sessions or grouping related comparisons together.
Concurrency Configuration#
You can customize the concurrency levels using the CONCURRENCIES environment variable:
# Custom concurrency levels
CONCURRENCIES="1,5,20,50" ./benchmarks/benchmark.sh --namespace $NAMESPACE --input my-test=components/backends/vllm/deploy/disagg.yaml
# Or set permanently
export CONCURRENCIES="1,2,5,10,25,50,100"
./benchmarks/benchmark.sh --namespace $NAMESPACE --input test=disagg.yaml
Understanding Your Results#
After benchmarking completes, check ./benchmarks/results/ (or your custom output directory):
Summary and Plots#
benchmarks/results/
├── SUMMARY.txt # Quick overview of all results
└── plots/ # Visual comparisons (these are what you want!)
├── p50_inter_token_latency_vs_concurrency.png # Token generation speed
├── avg_time_to_first_token_vs_concurrency.png # Response time
├── request_throughput_vs_concurrency.png # Requests per second
├── efficiency_tok_s_gpu_vs_user.png # GPU efficiency
└── avg_inter_token_latency_vs_concurrency.png # Average latency
Data Files#
Raw data is organized by deployment/benchmark type and concurrency level:
For Any Benchmarking (uses your custom labels):
benchmarks/results/
├── plots/ # Performance visualization plots
│ ├── SUMMARY.txt # Human-readable benchmark summary
│ ├── p50_inter_token_latency_vs_concurrency.png
│ ├── avg_inter_token_latency_vs_concurrency.png
│ ├── request_throughput_vs_concurrency.png
│ ├── efficiency_tok_s_gpu_vs_user.png
│ └── avg_time_to_first_token_vs_concurrency.png
├── <your-label-1>/ # Results for first input (uses your custom label)
│ ├── c1/ # Concurrency level 1
│ │ └── profile_export_genai_perf.json
│ ├── c2/ # Concurrency level 2
│ ├── c5/ # Concurrency level 5
│ └── ... # Other concurrency levels (10, 50, 100, 250)
├── <your-label-2>/ # Results for second input (if provided)
│ └── c*/ # Same structure as above
└── <your-label-N>/ # Results for additional inputs
└── c*/ # Same structure as above
Example with actual labels:
benchmarks/results/
├── plots/
├── dynamo-agg/ # --input dynamo-agg=agg.yaml
├── dynamo-disagg/ # --input dynamo-disagg=disagg.yaml
└── external-vllm/ # --input external-vllm=http://localhost:8000
Each concurrency directory contains:
profile_export_genai_perf.json- Structured metrics from GenAI-Perfprofile_export.json- Raw GenAI-Perf resultsinputs.json- Generated test inputs
Customize Benchmarking Behavior#
The built-in workflow handles DynamoGraphDeployment deployment, benchmarking with genai-perf, and plot generation automatically. If you want to modify the behavior:
Extend the workflow: Modify
benchmarks/utils/workflow.pyto add custom deployment types or metrics collectionGenerate different plots: Modify
benchmarks/utils/plot.pyto generate a different set of plots for whatever you wish to visualize.
The benchmark.sh script provides a complete end-to-end benchmarking experience. For more granular control, use the Python modules directly.