For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
      • Profiler Guide
      • Profiler Examples
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • DGDR Examples
  • Dense Model: Rapid
  • Dense Model: Thorough
  • MoE Model
  • Private Model
  • Custom SLA Targets
  • Overrides
  • SGLang Runtime Profiling
ComponentsProfiler

Profiler Examples

||View as Markdown|
Edit this page
Previous

Profiler Guide

Next

KVBM

Complete examples for profiling with DGDRs.

DGDR Examples

Dense Model: Rapid

Fast profiling (~30 seconds):

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: qwen-0-6b
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"

Dense Model: Thorough

Profiling with real GPU measurements:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: vllm-dense-online
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 backend: vllm
8 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
9 searchStrategy: thorough

MoE Model

Multi-node MoE profiling with SGLang:

The PVC referenced by modelCache.pvcName must already exist in the same namespace and contain the model weights at the specified pvcModelPath. The DGDR controller does not create or populate the PVC — it only mounts it into the profiling job and deployed workers.

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: sglang-moe
5spec:
6 model: "deepseek-ai/DeepSeek-R1"
7 backend: sglang
8 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
9
10 hardware:
11 numGpusPerNode: 8
12
13 modelCache:
14 pvcName: "model-cache"
15 pvcModelPath: "deepseek-r1" # path within the PVC

Private Model

For gated or private HuggingFace models, pass your token via an environment variable injected into the profiling job. Create the secret first:

$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="${HF_TOKEN}" \
> -n ${NAMESPACE}

Then reference it in your DGDR:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: llama-private
5spec:
6 model: "meta-llama/Llama-3.1-8B-Instruct"
7 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
8
9 overrides:
10 profilingJob:
11 template:
12 spec:
13 containers: [] # required placeholder; leave empty to inherit defaults
14 initContainers:
15 - name: profiler
16 env:
17 - name: HF_TOKEN
18 valueFrom:
19 secretKeyRef:
20 name: hf-token-secret
21 key: HF_TOKEN

Custom SLA Targets

Control how the profiler optimizes your deployment by specifying latency targets and workload characteristics.

Explicit TTFT + ITL targets (default mode):

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: low-latency-dense
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
8
9 sla:
10 ttft: 500 # Time To First Token target in milliseconds
11 itl: 20 # Inter-Token Latency target in milliseconds
12
13 workload:
14 isl: 2000 # expected input sequence length (tokens)
15 osl: 500 # expected output sequence length (tokens)

End-to-end latency target (alternative to ttft+itl):

1spec:
2 ...
3 sla:
4 e2eLatency: 10000 # total request latency budget in milliseconds

Optimization objective without explicit targets (maximize throughput or minimize latency):

1spec:
2 ...
3 sla:
4 optimizationType: throughput # or: latency

Overrides

Use overrides to customize the profiling job pod spec — for example to add tolerations for GPU node taints or inject environment variables.

GPU node toleration (common on GKE and shared clusters):

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: dense-with-tolerations
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
8
9 overrides:
10 profilingJob:
11 template:
12 spec:
13 containers: [] # required placeholder; leave empty to inherit defaults
14 tolerations:
15 - key: nvidia.com/gpu
16 operator: Exists
17 effect: NoSchedule

Override the generated DynamoGraphDeployment (e.g., to use a custom worker image):

1spec:
2 ...
3 overrides:
4 dgd:
5 apiVersion: nvidia.com/v1alpha1
6 kind: DynamoGraphDeployment
7 spec:
8 services:
9 VllmWorker:
10 extraEnvs:
11 - name: CUSTOM_ENV
12 value: "my-value"

SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

$# Start profiling
$curl -X POST http://localhost:9090/engine/start_profile \
> -H "Content-Type: application/json" \
> -d '{"output_dir": "/tmp/profiler_output"}'
$
$# Run inference requests to generate profiling data...
$
$# Stop profiling
$curl -X POST http://localhost:9090/engine/stop_profile

A test script is provided at examples/backends/sglang/test_sglang_profile.py:

$python examples/backends/sglang/test_sglang_profile.py

View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.