Profiler Examples#

Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.

DGDR Examples#

Dense Model: AIPerf on Real Engines#

Standard online profiling with real GPU measurements:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: vllm-dense-online
spec:
  model: "Qwen/Qwen3-0.6B"
  backend: vllm

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
    config:
      sla:
        isl: 3000
        osl: 150
        ttft: 200.0
        itl: 20.0

      hardware:
        minNumGpusPerEngine: 1
        maxNumGpusPerEngine: 8

      sweep:
        useAiConfigurator: false

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"

  autoApply: true

Dense Model: AI Configurator Simulation#

Fast offline profiling (~30 seconds, TensorRT-LLM only):

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: trtllm-aic-offline
spec:
  model: "Qwen/Qwen3-32B"
  backend: trtllm

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
    config:
      sla:
        isl: 4000
        osl: 500
        ttft: 300.0
        itl: 10.0

      sweep:
        useAiConfigurator: true
        aicSystem: h200_sxm  # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
        aicHfId: Qwen/Qwen3-32B
        aicBackendVersion: "0.20.0"

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"

  autoApply: true

MoE Model#

Multi-node MoE profiling with SGLang:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: sglang-moe
spec:
  model: "deepseek-ai/DeepSeek-R1"
  backend: sglang

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
    config:
      sla:
        isl: 2048
        osl: 512
        ttft: 300.0
        itl: 25.0

      hardware:
        numGpusPerNode: 8
        maxNumGpusPerEngine: 32

      engine:
        isMoeModel: true

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"

  autoApply: true

Using Existing DGD Config (ConfigMap)#

Reference a custom DGD configuration via ConfigMap:

# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
  --from-file=/path/to/your/disagg.yaml \
  --namespace $NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: deepseek-r1
spec:
  model: deepseek-ai/DeepSeek-R1
  backend: sglang

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
    configMapRef:
      name: deepseek-r1-config
      key: disagg.yaml
    config:
      sla:
        isl: 4000
        osl: 500
        ttft: 300
        itl: 10
      sweep:
        useAiConfigurator: true
        aicSystem: h200_sxm
        aicHfId: deepseek-ai/DeepSeek-V3
        aicBackendVersion: "0.20.0"

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"

  autoApply: true

Interactive WebUI#

Launch an interactive configuration selection interface:

python -m benchmarks.profiler.profile_sla \
  --backend trtllm \
  --config path/to/disagg.yaml \
  --pick-with-webui \
  --use-ai-configurator \
  --model Qwen/Qwen3-32B-FP8 \
  --aic-system h200_sxm \
  --ttft 200 --itl 15

The WebUI launches on port 8000 by default (configurable with --webui-port).

Features#

  • Interactive Charts: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables

  • Pareto-Optimal Analysis: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput

  • DGD Config Preview: Click “Show Config” on any row to view the corresponding DynamoGraphDeployment YAML

  • GPU Cost Estimation: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)

  • SLA Visualization: Red dashed lines indicate your TTFT and ITL targets

Selection Methods#

  1. GPU Hours Table (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination

  2. Individual Selection: Click one row in the Prefill table AND one row in the Decode table to manually choose each

Example DGD Config Output#

When you click “Show Config”, you see a DynamoGraphDeployment configuration:

# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
  services:
    PrefillWorker:
      subComponentType: prefill
      replicas: 1
      extraPodSpec:
        mainContainer:
          args:
          - --tensor-parallel-size=1
    DecodeWorker:
      subComponentType: decode
      replicas: 1
      extraPodSpec:
        mainContainer:
          args:
          - --tensor-parallel-size=4

Once you select a configuration, the full DGD CRD is saved as config_with_planner.yaml.

Direct Script Examples#

Basic Profiling#

python -m benchmarks.profiler.profile_sla \
  --backend vllm \
  --config path/to/disagg.yaml \
  --model meta-llama/Llama-3-8B \
  --ttft 200 --itl 15 \
  --isl 3000 --osl 150

With GPU Constraints#

python -m benchmarks.profiler.profile_sla \
  --backend sglang \
  --config examples/backends/sglang/deploy/disagg.yaml \
  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --ttft 200 --itl 15 \
  --isl 3000 --osl 150 \
  --min-num-gpus 2 \
  --max-num-gpus 8

AI Configurator (Offline)#

python -m benchmarks.profiler.profile_sla \
  --backend trtllm \
  --config path/to/disagg.yaml \
  --use-ai-configurator \
  --model Qwen/Qwen3-32B-FP8 \
  --aic-system h200_sxm \
  --ttft 200 --itl 15 \
  --isl 4000 --osl 500

SGLang Runtime Profiling#

Profile SGLang workers at runtime via HTTP endpoints:

# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
  -H "Content-Type: application/json" \
  -d '{"output_dir": "/tmp/profiler_output"}'

# Run inference requests to generate profiling data...

# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile

A test script is provided at examples/backends/sglang/test_sglang_profile.py:

python examples/backends/sglang/test_sglang_profile.py

View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.