This guide covers deployment, configuration, integration, and troubleshooting for the Dynamo Profiler.
A DynamoGraphDeploymentRequest (DGDR) is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. You specify:
model)ttft, itl)backend: vllm, sglang, or trtllm)profilingConfig.profilerImage, deploymentOverrides.workersImage)The Dynamo Operator watches for DGDRs and automatically:
Relationship to DGD:
The profiler sweeps over the following parallelization mappings for prefill and decode:
Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
The recommended deployment method is through DGDRs. Sample configurations are provided in benchmarks/profiler/deploy/:
Each DGDR requires container images for profiling and deployment:
profilingConfig.profilerImage (Required): Container image for the profiling job. Must contain the profiler code and dependencies.deploymentOverrides.workersImage (Optional): Container image for DGD worker components (frontend, workers, planner). If omitted, uses image from the base config file.Step 1: Create Your DGDR
Use a sample configuration or create your own:
Step 2: Apply the DGDR
Step 3: Monitor Progress
DGDR Status States:
Pending: Initial state, preparing to profileProfiling: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)Deploying: Generating and applying DGD configurationReady: DGD successfully deployed and runningFailed: Error occurred (check events for details)Step 4: Access Your Deployment
DGDRs are immutable. To update SLAs or configuration, delete the existing DGDR and create a new one.
For advanced use cases or local development:
The profiler follows a 5-step process:
attention_dp_size × attn_dp_num_req_ratio (defaults to 4) and compute the reported TTFT as time_to_first_token.max / attn_dp_num_req_ratio from the AIPerf summary of that burst.


Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
aicBackendVersion specifies the TensorRT-LLM version that AI Configurator simulates. See the AI Configurator supported features for available versions.
Currently supports:
See AI Configurator documentation for the full list.
Cluster-scoped operators can optionally enable automatic GPU discovery:
This is only available with cluster-scoped operators (namespaceRestriction.enabled=false) as it requires cluster-wide node access permissions.
All profiler configuration goes under spec.profilingConfig.config:
aicSystem in the sweep configuration insteadIf you don’t specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
true for 20-30 second profiling (TensorRT-LLM only)Required if useAiConfigurator: true:
Pass arguments to the SLA planner:
Planner arguments use planner_ prefix. See SLA Planner documentation for full list.
For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace:
Requirements:
{mountPath}/{pvcPath}The controller automatically injects these from high-level fields:
You should not manually set deployment.model or engine.backend in profilingConfig.config.
Reference an existing DGD config via ConfigMap:
The profiler uses the DGD config as a base template, then optimizes it based on your SLA targets.
CLI arguments map to DGDR config fields: --min-num-gpus = hardware.minNumGpusPerEngine, --max-num-gpus = hardware.maxNumGpusPerEngine, --use-ai-configurator = sweep.useAiConfigurator. See DGDR Configuration Structure for all field mappings.
The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions.
Prefill Interpolation (selected_prefill_interpolation/raw_data.npz):
prefill_isl: 1D array of input sequence lengths testedprefill_ttft: 1D array of TTFTs (ms) at each ISLprefill_thpt_per_gpu: 1D array of throughput (tokens/s/GPU) at each ISLDecode Interpolation (selected_decode_interpolation/raw_data.npz):
max_kv_tokens: Total KV tokens capacity in decode enginex_kv_usage: 1D array of active KV usage percentages [0, 1]y_context_length: 1D array of average context lengths testedz_itl: 1D array of ITLs (ms) at each (KV usage, context length) pointz_thpt_per_gpu: 1D array of throughput (tokens/s/GPU) at each pointWhen using DGDR, the Dynamo Operator:
planner-profile-data)The generated DGD is tracked via labels:
Monitor profiling jobs:
Disable auto-deployment to review the generated DGD before applying:
Then manually extract and apply:
Deploy a mocker deployment that simulates engines without GPUs:
Profiling still runs against the real backend to collect performance data. The mocker uses this data to simulate realistic timing behavior. Useful for large-scale experiments, testing Planner behavior, and validating configurations.
By default, profiling data is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC:
ConfigMaps (always created):
dgdr-output-<name>: Generated DGD configurationplanner-profile-data: Profiling data for Planner (JSON)PVC artifacts (optional):
.npz files)Access PVC results:
The profiler generates plots to visualize performance data:
Parallelization Mapping Sweep Plots:
prefill_performance.png: TTFT vs Parallelization Mapping sizedecode_performance.png: ITL vs Parallelization Mapping size and in-flight requestsIn-Depth Profiling Plots:
selected_prefill_interpolation/prefill_ttft_interpolation.png: TTFT vs ISLselected_prefill_interpolation/prefill_throughput_interpolation.png: Throughput vs ISLselected_decode_interpolation/decode_itl_interplation.png: ITL vs KV usage and context lengthselected_decode_interpolation/decode_throughput_interpolation.png: Throughput vs KV usage and context lengthSGLang workers expose profiling endpoints for runtime performance analysis:
View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.
Solution 1: Use AI Configurator for rapid profiling (TensorRT-LLM only):
Solution 2: Reduce search space:
Symptoms: Profiler reports no configuration meets targets
Solutions:
Symptoms: Profiling fails with error:
Cause: AI Configurator requires ≥4 attention heads per GPU. Small models with few heads cannot use high TP sizes.
Affected Models:
Solution: Limit maxNumGpusPerEngine:
Calculate Max TP: max_tp = num_attention_heads / 4
Symptoms: ErrImagePull or ImagePullBackOff
Solution: Ensure image pull secrets are configured:
Symptoms: OOM errors in profiling jobs
Solutions:
gpu_memory_utilization in engine config--max-context-lengthSymptoms: Startup/runtime error in the backend (e.g., prime number of attention heads constraining TP to 1, or backend not supporting different TP sizes for prefill and decode).
Solutions: