Profiler Examples | NVIDIA Dynamo Documentation

Complete examples for profiling with DGDRs.

DGDR Examples

Dense Model: Rapid

Fast profiling (~30 seconds):

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: qwen-0-6b
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

Dense Model: Thorough

Profiling with real GPU measurements:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: vllm-dense-online
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   backend: vllm
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0
9   searchStrategy: thorough

MoE Model

Multi-node MoE profiling with SGLang:

The PVC referenced by modelCache.pvcName must already exist in the same namespace and contain the model weights at the specified pvcModelPath. The DGDR controller does not create or populate the PVC — it only mounts it into the profiling job and deployed workers.

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: sglang-moe
5 spec:
6   model: "deepseek-ai/DeepSeek-R1"
7   backend: sglang
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0
9 
10   hardware:
11     numGpusPerNode: 8
12 
13   modelCache:
14     pvcName: "model-cache"
15     pvcModelPath: "deepseek-r1"      # path within the PVC

Private Model

For gated or private HuggingFace models, pass your token via an environment variable injected into the profiling job. Create the secret first:

$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="${HF_TOKEN}" \
>   -n ${NAMESPACE}

Then reference it in your DGDR:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: llama-private
5 spec:
6   model: "meta-llama/Llama-3.1-8B-Instruct"
7   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0
8 
9   overrides:
10     profilingJob:
11       template:
12         spec:
13           containers: []    # required placeholder; leave empty to inherit defaults
14           initContainers:
15             - name: profiler
16               env:
17                 - name: HF_TOKEN
18                   valueFrom:
19                     secretKeyRef:
20                       name: hf-token-secret
21                       key: HF_TOKEN

Custom SLA Targets

Control how the profiler optimizes your deployment by specifying latency targets and workload characteristics.

Explicit TTFT + ITL targets (default mode):

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: low-latency-dense
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0
8 
9   sla:
10     ttft: 500      # Time To First Token target in milliseconds
11     itl: 20        # Inter-Token Latency target in milliseconds
12 
13   workload:
14     isl: 2000      # expected input sequence length (tokens)
15     osl: 500       # expected output sequence length (tokens)

End-to-end latency target (alternative to ttft+itl):

1 spec:
2   ...
3   sla:
4     e2eLatency: 10000    # total request latency budget in milliseconds

Overrides

Use overrides to customize the profiling job pod spec — for example to add tolerations for GPU node taints or inject environment variables.

GPU node toleration (common on GKE and shared clusters):

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: dense-with-tolerations
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0
8 
9   overrides:
10     profilingJob:
11       template:
12         spec:
13           containers: []    # required placeholder; leave empty to inherit defaults
14           tolerations:
15             - key: nvidia.com/gpu
16               operator: Exists
17               effect: NoSchedule

Override the generated DynamoGraphDeployment (e.g., to inject worker environment variables):

1 spec:
2   ...
3   overrides:
4     dgd:
5       apiVersion: nvidia.com/v1alpha1
6       kind: DynamoGraphDeployment
7       spec:
8         envs:
9           - name: TRITON_PTXAS_PATH
10             value: "/usr/local/cuda/bin/ptxas"
11         services:
12           VllmWorker:
13             envs:
14               - name: CUSTOM_ENV
15                 value: "my-value"

SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

$ # Start profiling
$ curl -X POST http://localhost:9090/engine/start_profile \
>   -H "Content-Type: application/json" \
>   -d '{"output_dir": "/tmp/profiler_output"}'
$ 
$ # Run inference requests to generate profiling data...
$ 
$ # Stop profiling
$ curl -X POST http://localhost:9090/engine/stop_profile

A test script is provided at examples/backends/sglang/test_sglang_profile.py:

$ python examples/backends/sglang/test_sglang_profile.py

View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.