For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Dynamo provides automated SLA-driven profiling through DynamoGraphDeploymentRequests (DGDR). Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
Key Benefits:
Declarative: Specify SLAs, not implementation details
Automated: No manual job setup or result processing
Integrated: Seamlessly works with Dynamo Operator
Production-Ready: Generates optimized configurations with SLA planner
This document covers:
Technical details of online vs offline profiling
Profiling process internals (GPU usage, measurements, interpolation)
Direct script usage for advanced scenarios
Comprehensive troubleshooting
Support Matrix
Backend
Dense Models
MoE Models
vLLM
✅
🚧
SGLang
✅
✅
TensorRT-LLM
✅
🚧
Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode:
Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
Using DGDR for Profiling (Recommended)
The recommended way to profile models is through DGDRs. Sample configurations are provided in deploy/:
Available Samples:
profile_sla_dgdr.yaml: Standard profiling with AIPerf on real engines
profile_sla_aic_dgdr.yaml: Fast profiling with AI Configurator simulation
Runs profiling (AIPerf on real engines or AI Configurator simulation)
Generates optimal DGD configuration with SLA planner
Deploys the DGD to your cluster
See the Quick Start Guide for prerequisites and detailed instructions.
Hardware Configuration
Hardware parameters have sensible defaults and are optional - you can override them if needed:
1
profilingConfig:
2
config:
3
# Override hardware defaults if needed
4
hardware:
5
minNumGpusPerEngine: 1
6
maxNumGpusPerEngine: 8
7
numGpusPerNode: 8
8
9
# Only needed when using AI Configurator (sweep.useAiConfigurator: true)
10
sweep:
11
aicSystem: h200_sxm # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
Automatic GPU Discovery (Optional Feature)
Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
1
spec:
2
enableGpuDiscovery: true
This feature is only available with cluster-scoped operators (namespaceRestriction.enabled=false) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
Profiling Method
Hardware Setup: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
Identify Sweep Ranges: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
Parallelization Mapping Sweep: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings.
For dense models, we test different TP sizes for both prefill and decode.
For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode.
Prefill:
TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
DEP: Attention uses data parallelism. We send a single burst with total concurrency attention_dp_size × attn_dp_num_req_ratio (defaults to 4) and compute the reported TTFT as time_to_first_token.max / attn_dp_num_req_ratio from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
Decode: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
Recommendation: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
In-Depth Profiling on the Recommended P/D Engine: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
Prefill: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
Decode: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine’s forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations.
AIPerf on Real Engines
Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
Characteristics:
Duration: 2-4 hours
Accuracy: Highest (real measurements)
GPU Requirements: Full access to test different parallelization mappings
Backends: vLLM, SGLang, TensorRT-LLM
DGDR Configuration:
1
profilingConfig:
2
config:
3
sweep:
4
useAiConfigurator: false # Default
AI Configurator Simulation
Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
Characteristics:
Duration: 20-30 seconds
Accuracy: Estimated (may have errors for unusual configurations)
GPU Requirements: None
Backends: TensorRT-LLM only (vLLM/SGLang coming soon)
When running the profiler with --pick-with-webui, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations.
Features:
Interactive Charts: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
Pareto-Optimal Analysis: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
DGD Config Preview: Click “Show Config” on any row to view the corresponding DynamoGraphDeployment YAML
GPU Cost Estimation: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
SLA Visualization: Red dashed lines indicate your TTFT and ITL targets
Selection Methods:
GPU Hours Table (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
Individual Selection: Click one row in the Prefill table AND one row in the Decode table to manually choose each
Example DGD Config Output:
When you click “Show Config”, you’ll see a DynamoGraphDeployment configuration like:
1
# DynamoGraphDeployment Configuration
2
# Prefill: 1 GPU(s), TP=1
3
# Decode: 4 GPU(s), TP=4
4
# Model: Qwen/Qwen3-32B-FP8
5
# Backend: trtllm
6
apiVersion: nvidia.com/v1alpha1
7
kind: DynamoGraphDeployment
8
spec:
9
services:
10
PrefillWorker:
11
subComponentType: prefill
12
replicas: 1
13
extraPodSpec:
14
mainContainer:
15
args:
16
- --tensor-parallel-size=1
17
DecodeWorker:
18
subComponentType: decode
19
replicas: 1
20
extraPodSpec:
21
mainContainer:
22
args:
23
- --tensor-parallel-size=4
Usage:
$
python -m benchmarks.profiler.profile_sla \
>
--backend trtllm \
>
--config path/to/disagg.yaml \
>
--pick-with-webui \
>
--use-ai-configurator \
>
--model Qwen/Qwen3-32B-FP8 \
>
--aic-system h200_sxm \
>
--ttft 200 --itl 15
Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as config_with_planner.yaml.
The WebUI launches on port 8000 by default (configurable with --webui-port).
Output Performance Plots
The profiler will generate the following plots to better visualize the performance data:
Parallelization Mapping Sweep Plots:
prefill_performance.png: TTFT vs Parallelization Mapping size
decode_performance.png: ITL vs Parallelization Mapping size and in-flight requests
Note these two plots are based on the input ISL and OSL.
In-Depth Profiling for the Recommended P/D Engine Plots:
selected_prefill_interpolation/prefill_ttft_interpolation.png: TTFT vs ISL for the recommended prefill engine
selected_prefill_interpolation/prefill_throughput_interpolation.png: Throughput vs ISL for the recommended prefill engine
selected_decode_interpolation/decode_itl_interplation.png: ITL vs KV usage and context length for the recommended decode engine
selected_decode_interpolation/decode_throughput_interpolation.png: Throughput vs KV usage and context length for the recommended decode engine
Output Interpolation Data
The profiler generates .npz files to store the performance data for the recommended P/D engine:
max_kv_tokens: Total KV tokens capacity in decode engine
x_kv_usage: 1D array of active KV usage percentages [0, 1]
y_context_length: 1D array of average context lengths tested
z_itl: 1D array of ITLs (ms) at each (KV usage, context length) point
z_thpt_per_gpu: 1D array of throughput (tokens/s/GPU) at each point
DGDR Configuration Reference
This section provides detailed explanations of all DGDR profilingConfig options. The DGDR controller passes this configuration to the profiler script, which is defined in benchmarks/profiler/utils/profiler_argparse.py.
Configuration Structure
All profiler configuration goes under spec.profilingConfig.config:
1
apiVersion: nvidia.com/v1alpha1
2
kind: DynamoGraphDeploymentRequest
3
metadata:
4
name: my-deployment
5
spec:
6
model: "Qwen/Qwen3-0.6B" # High-level: model to deploy
Trade-offs: Tighter SLAs require more GPU resources
Hardware Configuration (Optional)
Control GPU search space and constraints:
1
profilingConfig:
2
config:
3
hardware:
4
minNumGpusPerEngine: 2 # if not provided, will automatically determine based on model and VRAM size
5
maxNumGpusPerEngine: 8 # Maximum GPUs to test
6
numGpusPerNode: 8 # GPUs per node (for multi-node MoE)
7
gpuType: h200_sxm # GPU type hint
When to use:
minNumGpusPerEngine: Skip small TP sizes if your model is large
maxNumGpusPerEngine: Limit search space or work around constraints (e.g., AIC attention heads)
numGpusPerNode: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
gpu_type: Informational, auto-detected by controller
If you don’t specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
Sweep Configuration (Optional)
Control profiling behavior:
1
profilingConfig:
2
config:
3
sweep:
4
useAiConfigurator: false # Use offline profiling (default: false)
5
prefillInterpolationGranularity: 16 # Samples for prefill TTFT curve
6
decodeInterpolationGranularity: 6 # Samples for decode ITL curve
Use cases:
useAiConfigurator: Set to true for 20-30 second profiling (TensorRT-LLM only)
prefillInterpolationGranularity: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
decodeInterpolationGranularity: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
AI Configurator Configuration (Required if useAiConfigurator: true)
Symptoms: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode.
Solutions:
Contact the backend to add support for the use cases and bump backend version in dynamo.
Restrain the max and min number of GPUs per engine to the supported range.