For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
      • Sizing with AIConfigurator
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • When to Use AIConfigurator
  • End-to-End Workflow
  • Aggregated vs Disaggregated Architecture
  • When to Use Each Architecture
  • Quick Start
  • Complete Walkthrough: vLLM on H200
  • Step 1: Run AIConfigurator
  • Step 2: Review the Results
  • Step 3: Deploy on Kubernetes
  • Prerequisites
  • Deploy the Configuration
  • Step 4: Validate with AIPerf
  • Fine-Tuning Your Deployment
  • Adjusting for Actual Workload
  • Exploring Alternative Configurations
  • Tuning vLLM-Specific Parameters
  • Prefix Caching Considerations
  • Supported Configurations
  • Backends and Versions
  • Common Use Cases
  • Additional Options
  • Troubleshooting
  • AIConfigurator Issues
  • Deployment Issues
  • Performance Issues
  • Learn More
Feature GuidesDisaggregated Serving

Sizing with AIConfigurator

Compare aggregated and disaggregated layouts before deployment
||View as Markdown|
Previous

Disaggregated Serving

Next

KVBM Guide

This page focuses on using AIConfigurator to size aggregated and disaggregated Dynamo deployments. For the serving architecture and deployment-path overview, start with Disaggregated Serving.

AIConfigurator is a performance optimization tool that helps you find a strong starting configuration for deploying LLMs with Dynamo. Given a supported model, GPU system, backend, and SLA target, it searches aggregated and disaggregated layouts and can generate deployment artifacts for the selected target.

When to Use AIConfigurator

When deploying LLMs with Dynamo, you need to make several critical decisions:

  • Aggregated vs Disaggregated: Which architecture gives better performance for your workload?
  • Worker Configuration: How many prefill and decode workers to deploy?
  • Parallelism Settings: What tensor/pipeline parallel configuration to use?
  • SLA Compliance: How to meet your TTFT and TPOT targets?

AIConfigurator is useful when you want:

  • candidate configurations that are filtered against your SLA requirements
  • generated Dynamo configuration files and Kubernetes manifests
  • performance comparisons between aggregated and disaggregated strategies
  • a support check for a model/system/backend combination before you tune by hand

Exact runtime and throughput gains depend on the model, hardware, backend, traffic shape, and available performance data. Treat AIConfigurator output as a validated starting point, then benchmark the generated configuration in your cluster.

End-to-End Workflow

AIConfigurator end-to-end workflow

Aggregated vs Disaggregated Architecture

AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:

Aggregated vs Disaggregated architecture comparison

When to Use Each Architecture

Decision flowchart for choosing aggregated vs disaggregated

Quick Start

$# Install
$pip3 install aiconfigurator
$
$# Optional: check whether the model/system/backend is covered
$aiconfigurator cli support \
> --model-path Qwen/Qwen3-32B-FP8 \
> --system h200_sxm \
> --backend vllm
$
$# Find optimal configuration for vLLM backend
$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 8 \
> --system h200_sxm \
> --backend vllm \
> --backend-version 0.12.0 \
> --isl 4000 \
> --osl 500 \
> --ttft 600 \
> --tpot 16.67 \
> --database-mode SILICON \
> --deployment-target dynamo-j2 \
> --save-dir ./results_vllm
$
$# Deploy on Kubernetes
$kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml

Complete Walkthrough: vLLM on H200

This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.

Step 1: Run AIConfigurator

$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --system h200_sxm \
> --total-gpus 8 \
> --isl 4000 \
> --osl 500 \
> --ttft 600 \
> --tpot 25 \
> --backend vllm \
> --backend-version 0.12.0 \
> --deployment-target dynamo-j2 \
> --generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
> --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
> --save-dir ./results_vllm

Parameters explained:

  • --model-path: HuggingFace model ID or local path (e.g., Qwen/Qwen3-32B-FP8). --model is also accepted as an alias.
  • --system: GPU system type (h200_sxm, h100_sxm, a100_sxm)
  • --total-gpus: Number of GPUs available for deployment
  • --isl / --osl: Input/Output sequence lengths in tokens
  • --ttft / --tpot: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
  • --backend: Inference backend (vllm, trtllm, or sglang)
  • --backend-version: Backend version (e.g., 0.12.0 for vLLM)
  • --deployment-target: Artifact target. dynamo-j2 generates Dynamo Kubernetes manifests; other targets are available in the upstream CLI.
  • --save-dir: Directory to save generated deployment configs

Step 2: Review the Results

AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:

********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
Total GPUs: 8
Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 3,574.80 tokens/s
- Per-GPU Throughput: 446.85 tokens/s/gpu
- Per-User Throughput: 53.58 tokens/s/user
- TTFT: 453.18ms
- TPOT: 18.66ms
- Request Latency: 9766.51ms
----------------------------------------------------------------------------
Pareto Frontier:
Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg │
│ ff disagg │
│ xx disagg best │
│ │
708.3┤ │
│ f │
│ f │
│ fff │
566.7┤ f │
│ f │
│ f │
│ •• fffffffffffffffffx │
425.0┤ •••• ff │
│ ••• f │
│ ••••• f │
│ •••••••••• f │
283.3┤ ••• f │
│ •• f │
│ •• f │
│ ••••f │
141.7┤ •f• │
│ f••••• │
│ f ••••••• │
│ fffff •••• │
0.0┤ •••• │
└┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
0 30 60 90 120
tokens/s/gpu_cluster tokens/s/user
----------------------------------------------------------------------------
Deployment Details:
(p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
Some math: total gpus used = replicas * gpus/replica
gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| 1 | vllm | 322.69 | 41.78 | 546.92 | 12490.03 | 64 (=32x2) | 8 (8=2x4) | 2 | 4 | 4 (=4x1x1) | tp4pp1 | 32 |
| 2 | vllm | 293.94 | 44.43 | 593.10 | 11823.67 | 56 (=14x4) | 8 (8=4x2) | 4 | 2 | 2 (=2x1x1) | tp2pp1 | 14 |
| 3 | vllm | 208.87 | 42.90 | 460.58 | 12093.52 | 40 (=40x1) | 8 (8=1x8) | 1 | 8 | 8 (=8x1x1) | tp8pp1 | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| 1 | vllm | 446.85 | 53.58 | 453.18 | 9766.51 | 76 (=76x1) | 8 (8=1x8) | 1 | 8 (=2x2+1x4) | 2 | 2 (=2x1) | tp2pp1 | 1 | 1 | 4 (=4x1) | tp4pp1 | 76 |
| 2 | vllm | 446.85 | 41.14 | 453.18 | 12581.87 | 144 (=72x2) | 8 (8=2x4) | 2 | 4 (=1x2+1x2) | 1 | 2 (=2x1) | tp2pp1 | 1 | 1 | 2 (=2x1) | tp2pp1 | 72 |
| 3 | vllm | 333.73 | 40.22 | 453.18 | 12860.32 | 72 (=36x2) | 8 (8=2x4) | 2 | 4 (=1x2+2x1) | 1 | 2 (=2x1) | tp2pp1 | 1 | 2 | 1 (=1x1) | tp1pp1 | 18 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+

Reading the output:

  • tokens/s/gpu: Overall throughput efficiency — higher is better
  • tokens/s/user: Per-request generation speed (inverse of TPOT)
  • TTFT: Predicted time to first token
  • concurrency: Total concurrent requests across all replicas (e.g., 56 (=14x4) means batch size 14 × 4 replicas)
  • agg Rank 1 recommends TP4 with 2 replicas — simpler to deploy
  • disagg Rank 1 recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA

Step 3: Deploy on Kubernetes

The --save-dir generates ready-to-use Kubernetes manifests:

├── agg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── agg_config.yaml
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ └── run_0.sh
│ ...
├── disagg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── decode_config.yaml
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ ├── prefill_config.yaml
│ │ ├── run_0.sh
│ │ └── run_1.sh (for multi-node setups)
│ ...
└── pareto_frontier.png

Prerequisites

Before deploying, ensure you have:

  1. HuggingFace Token Secret (for gated models):

    $kubectl create secret generic hf-token-secret \
    > -n your-namespace \
    > --from-literal=HF_TOKEN="your-huggingface-token"
  2. Model Cache PVC (recommended for faster restarts):

    1apiVersion: v1
    2kind: PersistentVolumeClaim
    3metadata:
    4 name: model-cache
    5 namespace: your-namespace
    6spec:
    7 accessModes:
    8 - ReadWriteMany
    9 resources:
    10 requests:
    11 storage: 100Gi

Deploy the Configuration

The generated k8s_deploy.yaml provides a starting point. You’ll typically need to customize it for your environment:

$kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml

Complete deployment example with model cache and production settings:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: dynamo-agg
5 namespace: your-namespace
6spec:
7 backendFramework: vllm
8 pvcs:
9 - name: model-cache
10 create: false # Use existing PVC
11 services:
12 Frontend:
13 componentType: frontend
14 replicas: 1
15 volumeMounts:
16 - name: model-cache
17 mountPoint: /opt/models
18 envs:
19 - name: HF_HOME
20 value: /opt/models
21 extraPodSpec:
22 mainContainer:
23 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
24 imagePullPolicy: IfNotPresent
25
26 VLLMWorker:
27 envFromSecret: hf-token-secret
28 componentType: worker
29 replicas: 4
30 resources:
31 limits:
32 gpu: "2"
33 sharedMemory:
34 size: 16Gi # Required for vLLM
35 volumeMounts:
36 - name: model-cache
37 mountPoint: /opt/models
38 envs:
39 - name: HF_HOME
40 value: /opt/models
41 extraPodSpec:
42 mainContainer:
43 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
44 workingDir: /workspace
45 imagePullPolicy: IfNotPresent
46 command:
47 - python3
48 - -m
49 - dynamo.vllm
50 args:
51 - --model
52 - "Qwen/Qwen3-32B-FP8"
53 - "--no-enable-prefix-caching"
54 - "--tensor-parallel-size"
55 - "2"
56 - "--pipeline-parallel-size"
57 - "1"
58 - "--data-parallel-size"
59 - "1"
60 - "--kv-cache-dtype"
61 - "fp8"
62 - "--max-model-len"
63 - "6000"
64 - "--max-num-seqs"
65 - "1024"

Key deployment settings:

SettingPurposeNotes
backendFramework: vllmTells Dynamo which runtime to useRequired at spec level
pvcs + volumeMountsCaches model weights across restartsMount at /opt/models (not /root/)
HF_HOME env varPoints HuggingFace to cache locationMust match mountPoint
sharedMemory.size: 16GiIPC memory for vLLM16Gi for vLLM, 80Gi for TRT-LLM
envFromSecretInjects HF_TOKENRequired for gated models

Step 4: Validate with AIPerf

After deployment, validate the predictions against actual performance using AIPerf.

ℹ️ Run AIPerf inside the cluster to avoid network latency affecting measurements.

AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when --save-dir ... is specified). For Kubernetes deployments, you can run benchmarks using k8s_bench.yaml; while for bare-metal systems, use the bench_run.sh script. These scripts execute AIPerf across a concurrency list: the default set (1 2 8 16 32 64 128) along with BenchConfig.estimated_concurrency and its values within ±5%. You can also customize this concurrency list as needed.

By default, AIPerf results will be saved in /tmp/bench_artifacts of the containers. If PVC name is specified in --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC, result artifacts will be saved in the PVC volume mount instead.

AIC-to-AIPerf parameter mapping

AIC OutputAIPerf ParameterNotes
concurrency: 56 (=14x4)--concurrency 56Use total concurrency when benchmarking via the frontend
ISL/OSL targets--isl 4000 --osl 500Match your AIC inputs
---num-requests 800Use concurrency × 40 minimum for statistical stability
---extra-inputs "ignore_eos:true"Ensures exact OSL tokens generated

Note on concurrency: AIC reports concurrency as total (=bs × replicas). When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica bs value instead.

1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: aiperf-benchmark
5 namespace: your-namespace
6spec:
7 template:
8 spec:
9 restartPolicy: Never
10 containers:
11 - name: aiperf
12 image: python:3.10
13 command:
14 - /bin/bash
15 - -c
16 - |
17 pip install aiperf
18 aiperf profile \
19 -m Qwen/Qwen3-32B-FP8 \
20 --endpoint-type chat \
21 -u http://dynamo-agg-frontend:8000 \
22 --isl 4000 --isl-stddev 0 \
23 --osl 500 --osl-stddev 0 \
24 --num-requests 800 \
25 --concurrency 56 \
26 --streaming \
27 --extra-inputs "ignore_eos:true" \
28 --num-warmup-requests 40 \
29 --ui-type simple
$kubectl apply -f k8s_bench.yaml
$kubectl logs -f -l job-name=aiperf-benchmark

Validated results (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):

MetricAIC PredictionActual (avg)Status
TTFT (ms)509209Better than target
ITL/TPOT (ms)16.4915.06Within 10%
Throughput (req/s)~6.36.9Within 10%
Total Output TPS~3,1783,462Within 10%

The table above is a validation example, not a universal guarantee. Expect variance across clusters, backend versions, model cache settings, and network fabric. Run multiple benchmark passes and compare against the generated concurrency and sequence-length assumptions.

Fine-Tuning Your Deployment

AIConfigurator provides a strong starting point. Here’s how to iterate for production:

Adjusting for Actual Workload

If your real workload differs from the benchmark parameters:

$# For longer outputs (chat/code generation):
$# increase OSL, relax TTFT target
$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 8 \
> --system h200_sxm \
> --backend vllm \
> --backend-version 0.12.0 \
> --isl 2000 \
> --osl 2000 \
> --ttft 1000 \
> --tpot 10 \
> --save-dir ./results_long_output

Exploring Alternative Configurations

Use exp mode to compare custom configurations:

1# custom_exp.yaml
2exps:
3 - exp_tp2
4 - exp_tp4
5
6exp_tp2:
7 mode: "patch"
8 serving_mode: "agg"
9 model_path: "Qwen/Qwen3-32B-FP8"
10 total_gpus: 8
11 system_name: "h200_sxm"
12 backend_name: "vllm"
13 backend_version: "0.12.0"
14 isl: 4000
15 osl: 500
16 ttft: 600
17 tpot: 16.67
18 config:
19 agg_worker_config:
20 tp_list: [2]
21
22exp_tp4:
23 mode: "patch"
24 serving_mode: "agg"
25 model_path: "Qwen/Qwen3-32B-FP8"
26 total_gpus: 8
27 system_name: "h200_sxm"
28 backend_name: "vllm"
29 backend_version: "0.12.0"
30 isl: 4000
31 osl: 500
32 ttft: 600
33 tpot: 16.67
34 config:
35 agg_worker_config:
36 tp_list: [4]
$aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom

For production disaggregated deployments, validate the KV transfer path before tuning replica counts. See Disaggregated Serving for RDMA prerequisites, the DGD resource pattern, and NIXL/UCX verification.

Tuning vLLM-Specific Parameters

Override vLLM engine parameters with --generator-set:

$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 8 \
> --system h200_sxm \
> --backend vllm \
> --backend-version 0.12.0 \
> --isl 4000 --osl 500 \
> --ttft 600 --tpot 16.67 \
> --save-dir ./results_tuned \
> --generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
> --generator-set Workers.agg.max_num_seqs=2048

Run aiconfigurator cli default --generator-help to see all available parameters.

Prefix Caching Considerations

For workloads with repeated prefixes (e.g., system prompts):

  • Enable prefix caching when you have high prefix hit rates
  • Disable prefix caching (--no-enable-prefix-caching) for diverse prompts

AIConfigurator’s default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.

Supported Configurations

Backends and Versions

For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the support matrix. The raw data is available as per-system CSV files, which are automatically generated and tested to ensure accuracy across all supported configurations.

You can also check if a system / framework version is supported via the aiconfigurator cli support command. For example:

$aiconfigurator cli support --model-path Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5

Common Use Cases

$# Strict latency SLAs (real-time chat)
$aiconfigurator cli default \
> --model-path meta-llama/Llama-3.1-70B \
> --total-gpus 16 \
> --system h200_sxm \
> --backend vllm \
> --backend-version 0.12.0 \
> --ttft 200 --tpot 8
$
$# High throughput (batch processing)
$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 32 \
> --system h200_sxm \
> --backend trtllm \
> --ttft 2000 --tpot 50
$
$# Request latency constraint (end-to-end SLA)
$aiconfigurator cli default \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 16 \
> --system h200_sxm \
> --backend vllm \
> --backend-version 0.12.0 \
> --request-latency 12000 \
> --isl 4000 --osl 500

Additional Options

$# Web interface for interactive exploration
$pip3 install aiconfigurator[webapp]
$aiconfigurator webapp # Visit http://127.0.0.1:7860
$
$# Quick config generation (no parameter sweep)
$aiconfigurator cli generate \
> --model-path Qwen/Qwen3-32B-FP8 \
> --total-gpus 8 \
> --system h200_sxm \
> --backend vllm
$
$# Check model/system support
$aiconfigurator cli support \
> --model-path Qwen/Qwen3-32B-FP8 \
> --system h200_sxm \
> --backend vllm

Troubleshooting

AIConfigurator Issues

Model not found: Use the full HuggingFace path (e.g., Qwen/Qwen3-32B-FP8 not QWEN3_32B)

Backend version mismatch: Check supported versions with aiconfigurator cli support --model-path <model> --system <system> --backend <backend>

Deployment Issues

Pods crash with “Permission denied” on cache directory:

  • Mount the PVC at /opt/models instead of /root/.cache/huggingface
  • Set HF_HOME=/opt/models environment variable
  • Ensure the PVC has ReadWriteMany access mode

Workers stuck in CrashLoopBackOff:

  • Check logs: kubectl logs <pod-name> --previous
  • Verify sharedMemory.size is set (16Gi for vLLM, 80Gi for TRT-LLM)
  • Ensure HuggingFace token secret exists and is named correctly

Model download slow on every restart:

  • Add PVC for model caching (see deployment example above)
  • Verify volumeMounts and HF_HOME are configured on workers

“Context stopped or killed” errors (disaggregated only):

  • Deploy ETCD and NATS infrastructure (required for KV cache transfer)
  • See Dynamo Kubernetes Guide for platform setup

Performance Issues

OOM errors: Reduce --max-num-seqs or increase tensor parallelism

Performance below predictions:

  • Verify warmup requests are sufficient (40+ recommended)
  • Check for competing workloads on the cluster
  • Ensure KV cache memory fraction is optimized
  • Run benchmarks from inside the cluster to eliminate network latency

Disaggregated TTFT extremely high (10+ seconds): Start by checking the RDMA and KV transfer path. Without RDMA or another fast transfer path, KV cache transfer may fall back to TCP and become a severe bottleneck.

To diagnose:

$# Check if RDMA resources are allocated
$kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"
$
$# Check UCX transport in logs
$kubectl logs <worker-pod> | grep -i "UCX\|transport"

To fix:

  1. Ensure your cluster has RDMA device plugin installed
  2. Add rdma/ib resource requests to worker pods
  3. Add IPC_LOCK capability to security context
  4. Add UCX environment variables. See Disaggregated Serving for the deployment pattern and verification steps.

Disaggregated working but throughput lower than aggregated: For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:

  • Very long inputs (ISL > 8000) with short outputs
  • Workloads needing independent prefill/decode scaling

Learn More

  • AIConfigurator CLI Guide
  • Dynamo Deployment Guide
  • Dynamo Installation Guide
  • Benchmarking Guide