Profiler | NVIDIA Dynamo Documentation

The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.

Feature Matrix

Feature	SGLang	TensorRT-LLM	vLLM
Dense Model Profiling	✅	✅	✅
MoE Model Profiling	✅	🚧	🚧
AI Configurator (Offline)	❌	✅	❌
Online Profiling (AIPerf)	✅	✅	✅
Interactive WebUI	✅	✅	✅
Runtime Profiling Endpoints	✅	❌	❌

Quick Start

Prerequisites

Dynamo platform installed (see Installation Guide)
Kubernetes cluster with GPU nodes (for DGDR-based profiling)
kube-prometheus-stack installed (required for SLA planner)

Using DynamoGraphDeploymentRequest (Recommended)

The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: my-model-profiling
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   backend: vllm
8   image: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0"
9 
10   workload:
11     isl: 3000      # Average input sequence length
12     osl: 150       # Average output sequence length
13 
14   sla:
15     ttft: 200.0    # Target Time To First Token (ms)
16     itl: 20.0      # Target Inter-Token Latency (ms)
17 
18   autoApply: true

$ kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE

Using AI Configurator (Fast Offline Profiling)

AI Configurator enables rapid offline profiling (~30 seconds) and supports all backends (vLLM, SGLang, TensorRT-LLM). Since searchStrategy: rapid is the default, AIC is used automatically unless you explicitly set searchStrategy: thorough.

Configuration

Parameter	Default	Description
`workload.isl`	4000	Average input sequence length (tokens)
`workload.osl`	1000	Average output sequence length (tokens)
`sla.ttft`	2000	Target Time To First Token (milliseconds)
`sla.itl`	30	Target Inter-Token Latency (milliseconds)
`hardware.numGpusPerNode`	auto	Number of GPUs per node
`hardware.gpuSku`	auto	GPU SKU identifier

Profiling Methods

Method	Duration	Accuracy	GPU Required	Backends
Online (AIPerf)	2-4 hours	Highest	Yes	All
Offline (AI Configurator)	20-30 seconds	Estimated	No	TensorRT-LLM

Output

The profiler generates:

Optimal Configuration: Recommended TP sizes for prefill and decode engines
Performance Data: Interpolation models for the SLA Planner
Generated DGD: Complete deployment manifest with optimized settings

Example recommendations:

Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)

Next Steps

Document	Description
Profiler Guide	Configuration, methods, and troubleshooting
Profiler Examples	Complete DGDR YAMLs, WebUI, script examples
SLA Planner Guide	End-to-end deployment workflow
SLA Planner Architecture	How the Planner uses profiling data