Profiler Examples#
Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
DGDR Examples#
Dense Model: AIPerf on Real Engines#
Standard online profiling with real GPU measurements:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
hardware:
minNumGpusPerEngine: 1
maxNumGpusPerEngine: 8
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
Dense Model: AI Configurator Simulation#
Fast offline profiling (~30 seconds, TensorRT-LLM only):
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
config:
sla:
isl: 4000
osl: 500
ttft: 300.0
itl: 10.0
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
autoApply: true
MoE Model#
Multi-node MoE profiling with SGLang:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sglang-moe
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
config:
sla:
isl: 2048
osl: 512
ttft: 300.0
itl: 25.0
hardware:
numGpusPerNode: 8
maxNumGpusPerEngine: 32
engine:
isMoeModel: true
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
Using Existing DGD Config (ConfigMap)#
Reference a custom DGD configuration via ConfigMap:
# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
--from-file=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
Interactive WebUI#
Launch an interactive configuration selection interface:
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--pick-with-webui \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15
The WebUI launches on port 8000 by default (configurable with --webui-port).
Features#
Interactive Charts: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
Pareto-Optimal Analysis: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
DGD Config Preview: Click “Show Config” on any row to view the corresponding DynamoGraphDeployment YAML
GPU Cost Estimation: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
SLA Visualization: Red dashed lines indicate your TTFT and ITL targets
Selection Methods#
GPU Hours Table (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
Individual Selection: Click one row in the Prefill table AND one row in the Decode table to manually choose each
Example DGD Config Output#
When you click “Show Config”, you see a DynamoGraphDeployment configuration:
# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
PrefillWorker:
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=1
DecodeWorker:
subComponentType: decode
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=4
Once you select a configuration, the full DGD CRD is saved as config_with_planner.yaml.
Direct Script Examples#
Basic Profiling#
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
With GPU Constraints#
python -m benchmarks.profiler.profile_sla \
--backend sglang \
--config examples/backends/sglang/deploy/disagg.yaml \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 2 \
--max-num-gpus 8
AI Configurator (Offline)#
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15 \
--isl 4000 --osl 500
SGLang Runtime Profiling#
Profile SGLang workers at runtime via HTTP endpoints:
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests to generate profiling data...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
A test script is provided at examples/backends/sglang/test_sglang_profile.py:
python examples/backends/sglang/test_sglang_profile.py
View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.