> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Sizing with AIConfigurator

This page focuses on using
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) to size
aggregated and disaggregated Dynamo deployments. For the serving architecture
and deployment-path overview, start with [Disaggregated Serving](/dynamo/dev/user-guides/disaggregated-serving).

AIConfigurator is a performance optimization tool that helps you find a strong
starting configuration for deploying LLMs with Dynamo. Given a supported model,
GPU system, backend, and SLA target, it searches aggregated and disaggregated
layouts and can generate deployment artifacts for the selected target.

## When to Use AIConfigurator

When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?

AIConfigurator is useful when you want:
- candidate configurations that are filtered against your SLA requirements
- generated Dynamo configuration files and Kubernetes manifests
- performance comparisons between aggregated and disaggregated strategies
- a support check for a model/system/backend combination before you tune by hand

Exact runtime and throughput gains depend on the model, hardware, backend,
traffic shape, and available performance data. Treat AIConfigurator output as a
validated starting point, then benchmark the generated configuration in your
cluster.

### End-to-End Workflow

![AIConfigurator end-to-end workflow](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/bc459161bf6b32c800fb7a1dd007c8134947303b4eed6bfa725ac02b32b6867b/pages-dev/assets/img/e2e-workflow.svg)

### Aggregated vs Disaggregated Architecture

AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:

![Aggregated vs Disaggregated architecture comparison](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/a8e2cf7c5ff4855f4bd1e1ccc2a8076aad952d60dc16c77432769edbd1203a5d/pages-dev/assets/img/arch-comparison.svg)

### When to Use Each Architecture

![Decision flowchart for choosing aggregated vs disaggregated](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ac64588ed77cb9588defcadd45b394b7a7e20e0b212ed6199bda37f5eda7a2d6/pages-dev/assets/img/decision-flowchart.svg)

## Quick Start

```bash
# Install
pip3 install aiconfigurator

# Optional: check whether the model/system/backend is covered
aiconfigurator cli support \
  --model-path Qwen/Qwen3-32B-FP8 \
  --system h200_sxm \
  --backend vllm

# Find optimal configuration for vLLM backend
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
  --system h200_sxm \
  --backend vllm \
  --backend-version 0.12.0 \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --database-mode SILICON \
  --deployment-target dynamo-j2 \
  --save-dir ./results_vllm

# Deploy on Kubernetes
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```

## Complete Walkthrough: vLLM on H200

This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.

### Step 1: Run AIConfigurator

```bash
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --system h200_sxm \
  --total-gpus 8 \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 25 \
  --backend vllm \
  --backend-version 0.12.0 \
  --deployment-target dynamo-j2 \
  --generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
  --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
  --save-dir ./results_vllm
```

**Parameters explained:**
- `--model-path`: HuggingFace model ID or local path (e.g., `Qwen/Qwen3-32B-FP8`). `--model` is also accepted as an alias.
- `--system`: GPU system type (`h200_sxm`, `h100_sxm`, `a100_sxm`)
- `--total-gpus`: Number of GPUs available for deployment
- `--isl` / `--osl`: Input/Output sequence lengths in tokens
- `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`)
- `--backend-version`: Backend version (e.g., `0.12.0` for vLLM)
- `--deployment-target`: Artifact target. `dynamo-j2` generates Dynamo Kubernetes manifests; other targets are available in the upstream CLI.
- `--save-dir`: Directory to save generated deployment configs

### Step 2: Review the Results

AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:

```text
********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  ----------------------------------------------------------------------------
  Input Configuration & SLA Target:
    Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
    Total GPUs: 8
    Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
  ----------------------------------------------------------------------------
  Overall Best Configuration:
    - Best Throughput: 3,574.80 tokens/s
    - Per-GPU Throughput: 446.85 tokens/s/gpu
    - Per-User Throughput: 53.58 tokens/s/user
    - TTFT: 453.18ms
    - TPOT: 18.66ms
    - Request Latency: 9766.51ms
  ----------------------------------------------------------------------------
  Pareto Frontier:
      Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
     ┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg                                                                  │
     │ ff disagg                                                               │
     │ xx disagg best                                                          │
     │                                                                         │
708.3┤                                                                         │
     │         f                                                               │
     │         f                                                               │
     │          fff                                                            │
566.7┤             f                                                           │
     │             f                                                           │
     │              f                                                          │
     │    ••         fffffffffffffffffx                                        │
425.0┤     ••••                        ff                                      │
     │        •••                       f                                      │
     │           •••••                  f                                      │
     │                ••••••••••        f                                      │
283.3┤                          •••     f                                      │
     │                             ••    f                                     │
     │                               ••  f                                     │
     │                                ••••f                                    │
141.7┤                                   •f•                                   │
     │                                     f•••••                              │
     │                                      f    •••••••                       │
     │                                       fffff      ••••                   │
  0.0┤                                                      ••••               │
     └┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
      0                30                60                90               120
tokens/s/gpu_cluster                tokens/s/user

  ----------------------------------------------------------------------------
  Deployment Details:
    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
    Some math: total gpus used = replicas * gpus/replica
               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)

agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
|  1   |   vllm  |    322.69    |     41.78     | 546.92 |     12490.03    |  64 (=32x2) |     8 (8=2x4)     |    2     |      4       |  4 (=4x1x1) |  tp4pp1  | 32 |
|  2   |   vllm  |    293.94    |     44.43     | 593.10 |     11823.67    |  56 (=14x4) |     8 (8=4x2)     |    4     |      2       |  2 (=2x1x1) |  tp2pp1  | 14 |
|  3   |   vllm  |    208.87    |     42.90     | 460.58 |     12093.52    |  40 (=40x1) |     8 (8=1x8)     |    1     |      8       |  8 (=8x1x1) |  tp8pp1  | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+

disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
|  1   |   vllm  |    446.85    |     53.58     | 453.18 |     9766.51     |  76 (=76x1) |     8 (8=1x8)     |    1     | 8 (=2x2+1x4) |     2      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    4 (=4x1)    |    tp4pp1   |   76  |
|  2   |   vllm  |    446.85    |     41.14     | 453.18 |     12581.87    | 144 (=72x2) |     8 (8=2x4)     |    2     | 4 (=1x2+1x2) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    2 (=2x1)    |    tp2pp1   |   72  |
|  3   |   vllm  |    333.73    |     40.22     | 453.18 |     12860.32    |  72 (=36x2) |     8 (8=2x4)     |    2     | 4 (=1x2+2x1) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     2      |    1 (=1x1)    |    tp1pp1   |   18  |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
```

**Reading the output:**
- **tokens/s/gpu**: Overall throughput efficiency — higher is better
- **tokens/s/user**: Per-request generation speed (inverse of TPOT)
- **TTFT**: Predicted time to first token
- **concurrency**: Total concurrent requests across all replicas (e.g., `56 (=14x4)` means batch size 14 × 4 replicas)
- **agg Rank 1** recommends TP4 with 2 replicas — simpler to deploy
- **disagg Rank 1** recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA

### Step 3: Deploy on Kubernetes

The `--save-dir` generates ready-to-use Kubernetes manifests:

```
├── agg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── agg_config.yaml
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   └── run_0.sh
│   ...
├── disagg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── decode_config.yaml
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   ├── prefill_config.yaml
│   │   ├── run_0.sh
│   │   └── run_1.sh  (for multi-node setups)
│   ...
└── pareto_frontier.png
```

#### Prerequisites

Before deploying, ensure you have:

1. **HuggingFace Token Secret** (for gated models):
   ```bash
   kubectl create secret generic hf-token-secret \
     -n your-namespace \
     --from-literal=HF_TOKEN="your-huggingface-token"
   ```

2. **Model Cache PVC** (recommended for faster restarts):
   ```yaml
   apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: model-cache
     namespace: your-namespace
   spec:
     accessModes:
       - ReadWriteMany
     resources:
       requests:
         storage: 100Gi
   ```

#### Deploy the Configuration

The generated `k8s_deploy.yaml` provides a starting point. You'll typically need to customize it for your environment:

```bash
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```

**Complete deployment example** with model cache and production settings:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: dynamo-agg
  namespace: your-namespace
spec:
  backendFramework: vllm
  pvcs:
    - name: model-cache
      create: false           # Use existing PVC
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
          imagePullPolicy: IfNotPresent

    VLLMWorker:
      envFromSecret: hf-token-secret
      componentType: worker
      replicas: 4
      resources:
        limits:
          gpu: "2"
      sharedMemory:
        size: 16Gi            # Required for vLLM
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
          workingDir: /workspace
          imagePullPolicy: IfNotPresent
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - "Qwen/Qwen3-32B-FP8"
            - "--no-enable-prefix-caching"
            - "--tensor-parallel-size"
            - "2"
            - "--pipeline-parallel-size"
            - "1"
            - "--data-parallel-size"
            - "1"
            - "--kv-cache-dtype"
            - "fp8"
            - "--max-model-len"
            - "6000"
            - "--max-num-seqs"
            - "1024"
```

**Key deployment settings:**

| Setting | Purpose | Notes |
|---------|---------|-------|
| `backendFramework: vllm` | Tells Dynamo which runtime to use | Required at spec level |
| `pvcs` + `volumeMounts` | Caches model weights across restarts | Mount at `/opt/models` (not `/root/`) |
| `HF_HOME` env var | Points HuggingFace to cache location | Must match `mountPoint` |
| `sharedMemory.size: 16Gi` | IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM |
| `envFromSecret` | Injects HF_TOKEN | Required for gated models |

### Step 4: Validate with AIPerf

After deployment, validate the predictions against actual performance using [AIPerf](https://github.com/ai-dynamo/aiperf).

> ℹ️ Run AIPerf **inside the cluster** to avoid network latency affecting measurements.

AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when `--save-dir ...` is specified). For Kubernetes deployments, you can run benchmarks using `k8s_bench.yaml`; while for bare-metal systems, use the `bench_run.sh` script. These scripts execute AIPerf across a concurrency list: the default set (`1 2 8 16 32 64 128`) along with `BenchConfig.estimated_concurrency` and its values within ±5%. You can also customize this concurrency list as needed.

By default, AIPerf results will be saved in `/tmp/bench_artifacts` of the containers. If PVC name is specified in `--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC`, result artifacts will be saved in the PVC volume mount instead.

![AIC-to-AIPerf parameter mapping](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/e399bacf9aea6e2fad40a86facb4af53fcaa3b7b974c42fcf0a075ecea1ab0c3/pages-dev/assets/img/param-mapping.svg)

| AIC Output | AIPerf Parameter | Notes |
|------------|-----------------|-------|
| `concurrency: 56 (=14x4)` | `--concurrency 56` | Use total concurrency when benchmarking via the frontend |
| ISL/OSL targets | `--isl 4000 --osl 500` | Match your AIC inputs |
| - | `--num-requests 800` | Use `concurrency × 40` minimum for statistical stability |
| - | `--extra-inputs "ignore_eos:true"` | Ensures exact OSL tokens generated |

> **Note on concurrency**: AIC reports concurrency as `total (=bs × replicas)`. When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica `bs` value instead.

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
  namespace: your-namespace
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: aiperf
        image: python:3.10
        command:
        - /bin/bash
        - -c
        - |
          pip install aiperf
          aiperf profile \
            -m Qwen/Qwen3-32B-FP8 \
            --endpoint-type chat \
            -u http://dynamo-agg-frontend:8000 \
            --isl 4000 --isl-stddev 0 \
            --osl 500 --osl-stddev 0 \
            --num-requests 800 \
            --concurrency 56 \
            --streaming \
            --extra-inputs "ignore_eos:true" \
            --num-warmup-requests 40 \
            --ui-type simple
```

```bash
kubectl apply -f k8s_bench.yaml
kubectl logs -f -l job-name=aiperf-benchmark
```

**Validated results** (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):

| Metric | AIC Prediction | Actual (avg) | Status |
|--------|---------------|--------------|--------|
| TTFT (ms) | 509 | 209 | Better than target |
| ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% |
| Throughput (req/s) | ~6.3 | 6.9 | Within 10% |
| Total Output TPS | ~3,178 | 3,462 | Within 10% |

<Note>
The table above is a validation example, not a universal guarantee. Expect
variance across clusters, backend versions, model cache settings, and network
fabric. Run multiple benchmark passes and compare against the generated
concurrency and sequence-length assumptions.
</Note>

## Fine-Tuning Your Deployment

AIConfigurator provides a strong starting point. Here's how to iterate for production:

### Adjusting for Actual Workload

If your real workload differs from the benchmark parameters:

```bash
# For longer outputs (chat/code generation):
# increase OSL, relax TTFT target
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
  --system h200_sxm \
  --backend vllm \
  --backend-version 0.12.0 \
  --isl 2000 \
  --osl 2000 \
  --ttft 1000 \
  --tpot 10 \
  --save-dir ./results_long_output
```

### Exploring Alternative Configurations

Use `exp` mode to compare custom configurations:

```yaml
# custom_exp.yaml
exps:
  - exp_tp2
  - exp_tp4

exp_tp2:
  mode: "patch"
  serving_mode: "agg"
  model_path: "Qwen/Qwen3-32B-FP8"
  total_gpus: 8
  system_name: "h200_sxm"
  backend_name: "vllm"
  backend_version: "0.12.0"
  isl: 4000
  osl: 500
  ttft: 600
  tpot: 16.67
  config:
    agg_worker_config:
      tp_list: [2]

exp_tp4:
  mode: "patch"
  serving_mode: "agg"
  model_path: "Qwen/Qwen3-32B-FP8"
  total_gpus: 8
  system_name: "h200_sxm"
  backend_name: "vllm"
  backend_version: "0.12.0"
  isl: 4000
  osl: 500
  ttft: 600
  tpot: 16.67
  config:
    agg_worker_config:
      tp_list: [4]
```

```bash
aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom
```

For production disaggregated deployments, validate the KV transfer path before
tuning replica counts. See [Disaggregated Serving](/dynamo/dev/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma)
for RDMA prerequisites, the DGD resource pattern, and NIXL/UCX verification.

### Tuning vLLM-Specific Parameters

Override vLLM engine parameters with `--generator-set`:

```bash
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
  --system h200_sxm \
  --backend vllm \
  --backend-version 0.12.0 \
  --isl 4000 --osl 500 \
  --ttft 600 --tpot 16.67 \
  --save-dir ./results_tuned \
  --generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
  --generator-set Workers.agg.max_num_seqs=2048
```

Run `aiconfigurator cli default --generator-help` to see all available parameters.

### Prefix Caching Considerations

For workloads with repeated prefixes (e.g., system prompts):

- **Enable prefix caching** when you have high prefix hit rates
- **Disable prefix caching** (`--no-enable-prefix-caching`) for diverse prompts

AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.

## Supported Configurations

### Backends and Versions

For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the [**support matrix**](https://ai-dynamo.github.io/aiconfigurator/support-matrix/). The raw data is available as [per-system CSV files](https://github.com/ai-dynamo/aiconfigurator/tree/main/src/aiconfigurator/systems/support_matrix), which are automatically generated and tested to ensure accuracy across all supported configurations.

You can also check if a system / framework version is supported via the `aiconfigurator cli support` command. For example:
```bash
aiconfigurator cli support --model-path Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5
```


## Common Use Cases

```bash
# Strict latency SLAs (real-time chat)
aiconfigurator cli default \
  --model-path meta-llama/Llama-3.1-70B \
  --total-gpus 16 \
  --system h200_sxm \
  --backend vllm \
  --backend-version 0.12.0 \
  --ttft 200 --tpot 8

# High throughput (batch processing)
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 32 \
  --system h200_sxm \
  --backend trtllm \
  --ttft 2000 --tpot 50

# Request latency constraint (end-to-end SLA)
aiconfigurator cli default \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 16 \
  --system h200_sxm \
  --backend vllm \
  --backend-version 0.12.0 \
  --request-latency 12000 \
  --isl 4000 --osl 500
```

## Additional Options

```bash
# Web interface for interactive exploration
pip3 install aiconfigurator[webapp]
aiconfigurator webapp  # Visit http://127.0.0.1:7860

# Quick config generation (no parameter sweep)
aiconfigurator cli generate \
  --model-path Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
  --system h200_sxm \
  --backend vllm

# Check model/system support
aiconfigurator cli support \
  --model-path Qwen/Qwen3-32B-FP8 \
  --system h200_sxm \
  --backend vllm
```

## Troubleshooting

### AIConfigurator Issues

**Model not found**: Use the full HuggingFace path (e.g., `Qwen/Qwen3-32B-FP8` not `QWEN3_32B`)

**Backend version mismatch**: Check supported versions with `aiconfigurator cli support --model-path <model> --system <system> --backend <backend>`

### Deployment Issues

**Pods crash with "Permission denied" on cache directory**:
- Mount the PVC at `/opt/models` instead of `/root/.cache/huggingface`
- Set `HF_HOME=/opt/models` environment variable
- Ensure the PVC has `ReadWriteMany` access mode

**Workers stuck in CrashLoopBackOff**:
- Check logs: `kubectl logs <pod-name> --previous`
- Verify `sharedMemory.size` is set (16Gi for vLLM, 80Gi for TRT-LLM)
- Ensure HuggingFace token secret exists and is named correctly

**Model download slow on every restart**:
- Add PVC for model caching (see deployment example above)
- Verify `volumeMounts` and `HF_HOME` are configured on workers

**"Context stopped or killed" errors (disaggregated only)**:
- Deploy ETCD and NATS infrastructure (required for KV cache transfer)
- See [Dynamo Kubernetes Guide](/dynamo/dev/kubernetes-deployment/start-here/kubernetes-quickstart) for platform setup

### Performance Issues

**OOM errors**: Reduce `--max-num-seqs` or increase tensor parallelism

**Performance below predictions**:
- Verify warmup requests are sufficient (40+ recommended)
- Check for competing workloads on the cluster
- Ensure KV cache memory fraction is optimized
- Run benchmarks from inside the cluster to eliminate network latency

**Disaggregated TTFT extremely high (10+ seconds)**:
Start by checking the **RDMA and KV transfer path**. Without RDMA or another
fast transfer path, KV cache transfer may fall back to TCP and become a severe
bottleneck.

To diagnose:
```bash
# Check if RDMA resources are allocated
kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"

# Check UCX transport in logs
kubectl logs <worker-pod> | grep -i "UCX\|transport"
```

To fix:
1. Ensure your cluster has RDMA device plugin installed
2. Add `rdma/ib` resource requests to worker pods
3. Add `IPC_LOCK` capability to security context
4. Add UCX environment variables. See [Disaggregated Serving](/dynamo/dev/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma)
   for the deployment pattern and verification steps.

**Disaggregated working but throughput lower than aggregated**:
For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:
- Very long inputs (ISL > 8000) with short outputs
- Workloads needing independent prefill/decode scaling

## Learn More

- [AIConfigurator CLI Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/cli_user_guide.md)
- [Dynamo Deployment Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md)
- [Dynamo Installation Guide](/dynamo/dev/kubernetes-deployment/start-here/installation-guide)
- [Benchmarking Guide](/dynamo/dev/user-guides/benchmarking)