Disaggregated Serving | NVIDIA Dynamo Documentation

AIConfigurator is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.

Why Use AIConfigurator?

When deploying LLMs with Dynamo, you need to make several critical decisions:

Aggregated vs Disaggregated: Which architecture gives better performance for your workload?
Worker Configuration: How many prefill and decode workers to deploy?
Parallelism Settings: What tensor/pipeline parallel configuration to use?
SLA Compliance: How to meet your TTFT and TPOT targets?

AIConfigurator answers these questions in seconds, providing:

Recommended configurations that meet your SLA requirements
Ready-to-deploy Dynamo configuration files (including Kubernetes manifests)
Performance comparisons between different deployment strategies
Up to 1.7x better throughput compared to manual configuration

End-to-End Workflow

AIConfigurator end-to-end workflow

Aggregated vs Disaggregated Architecture

AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:

Aggregated vs Disaggregated architecture comparison

When to Use Each Architecture

Decision flowchart for choosing aggregated vs disaggregated

Quick Start

$ # Install
$ pip3 install aiconfigurator
$ 
$ # Find optimal configuration for vLLM backend
$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 8 \
>   --system h200_sxm \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --isl 4000 \
>   --osl 500 \
>   --ttft 600 \
>   --tpot 16.67 \
>   --save-dir ./results_vllm
$ 
$ # Deploy on Kubernetes
$ kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml

Complete Walkthrough: vLLM on H200

This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.

Step 1: Run AIConfigurator

$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --system h200_sxm \
>   --total-gpus 8 \
>   --isl 4000 \
>   --osl 500 \
>   --ttft 600 \
>   --tpot 25 \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --generator-dynamo-version 1.0.0 \
>   --generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
>   --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
>   --save-dir ./results_vllm

Parameters explained:

--model: HuggingFace model ID or local path (e.g., Qwen/Qwen3-32B-FP8)
--system: GPU system type (h200_sxm, h100_sxm, a100_sxm)
--total-gpus: Number of GPUs available for deployment
--isl / --osl: Input/Output sequence lengths in tokens
--ttft / --tpot: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
--backend: Inference backend (vllm, trtllm, or sglang)
--backend-version: Backend version (e.g., 0.12.0 for vLLM)
--save-dir: Directory to save generated deployment configs

Step 2: Review the Results

AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:

********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  ----------------------------------------------------------------------------
  Input Configuration & SLA Target:
    Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
    Total GPUs: 8
    Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
  ----------------------------------------------------------------------------
  Overall Best Configuration:
    - Best Throughput: 3,574.80 tokens/s
    - Per-GPU Throughput: 446.85 tokens/s/gpu
    - Per-User Throughput: 53.58 tokens/s/user
    - TTFT: 453.18ms
    - TPOT: 18.66ms
    - Request Latency: 9766.51ms
  ----------------------------------------------------------------------------
  Pareto Frontier:
      Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
     ┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg                                                                  │
     │ ff disagg                                                               │
     │ xx disagg best                                                          │
     │                                                                         │
708.3┤                                                                         │
     │         f                                                               │
     │         f                                                               │
     │          fff                                                            │
566.7┤             f                                                           │
     │             f                                                           │
     │              f                                                          │
     │    ••         fffffffffffffffffx                                        │
425.0┤     ••••                        ff                                      │
     │        •••                       f                                      │
     │           •••••                  f                                      │
     │                ••••••••••        f                                      │
283.3┤                          •••     f                                      │
     │                             ••    f                                     │
     │                               ••  f                                     │
     │                                ••••f                                    │
141.7┤                                   •f•                                   │
     │                                     f•••••                              │
     │                                      f    •••••••                       │
     │                                       fffff      ••••                   │
  0.0┤                                                      ••••               │
     └┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
      0                30                60                90               120
tokens/s/gpu_cluster                tokens/s/user
  ----------------------------------------------------------------------------
  Deployment Details:
    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
    Some math: total gpus used = replicas * gpus/replica
               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
|  1   |   vllm  |    322.69    |     41.78     | 546.92 |     12490.03    |  64 (=32x2) |     8 (8=2x4)     |    2     |      4       |  4 (=4x1x1) |  tp4pp1  | 32 |
|  2   |   vllm  |    293.94    |     44.43     | 593.10 |     11823.67    |  56 (=14x4) |     8 (8=4x2)     |    4     |      2       |  2 (=2x1x1) |  tp2pp1  | 14 |
|  3   |   vllm  |    208.87    |     42.90     | 460.58 |     12093.52    |  40 (=40x1) |     8 (8=1x8)     |    1     |      8       |  8 (=8x1x1) |  tp8pp1  | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
|  1   |   vllm  |    446.85    |     53.58     | 453.18 |     9766.51     |  76 (=76x1) |     8 (8=1x8)     |    1     | 8 (=2x2+1x4) |     2      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    4 (=4x1)    |    tp4pp1   |   76  |
|  2   |   vllm  |    446.85    |     41.14     | 453.18 |     12581.87    | 144 (=72x2) |     8 (8=2x4)     |    2     | 4 (=1x2+1x2) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    2 (=2x1)    |    tp2pp1   |   72  |
|  3   |   vllm  |    333.73    |     40.22     | 453.18 |     12860.32    |  72 (=36x2) |     8 (8=2x4)     |    2     | 4 (=1x2+2x1) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     2      |    1 (=1x1)    |    tp1pp1   |   18  |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+

Reading the output:

tokens/s/gpu: Overall throughput efficiency — higher is better
tokens/s/user: Per-request generation speed (inverse of TPOT)
TTFT: Predicted time to first token
concurrency: Total concurrent requests across all replicas (e.g., 56 (=14x4) means batch size 14 × 4 replicas)
agg Rank 1 recommends TP4 with 2 replicas — simpler to deploy
disagg Rank 1 recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA

Step 3: Deploy on Kubernetes

The --save-dir generates ready-to-use Kubernetes manifests:

├── agg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── agg_config.yaml
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   └── run_0.sh
│   ...
├── disagg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── decode_config.yaml
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   ├── prefill_config.yaml
│   │   ├── run_0.sh
│   │   └── run_1.sh  (for multi-node setups)
│   ...
└── pareto_frontier.png

Prerequisites

Before deploying, ensure you have:

HuggingFace Token Secret (for gated models):

$ kubectl create secret generic hf-token-secret \
>   -n your-namespace \
>   --from-literal=HF_TOKEN="your-huggingface-token"

Model Cache PVC (recommended for faster restarts):

1 apiVersion: v1
2 kind: PersistentVolumeClaim
3 metadata:
4   name: model-cache
5   namespace: your-namespace
6 spec:
7   accessModes:
8     - ReadWriteMany
9   resources:
10     requests:
11       storage: 100Gi

Deploy the Configuration

The generated k8s_deploy.yaml provides a starting point. You’ll typically need to customize it for your environment:

$ kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml

Complete deployment example with model cache and production settings:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: dynamo-agg
5   namespace: your-namespace
6 spec:
7   backendFramework: vllm
8   pvcs:
9     - name: model-cache
10       create: false           # Use existing PVC
11   services:
12     Frontend:
13       componentType: frontend
14       replicas: 1
15       volumeMounts:
16         - name: model-cache
17           mountPoint: /opt/models
18       envs:
19         - name: HF_HOME
20           value: /opt/models
21       extraPodSpec:
22         mainContainer:
23           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
24           imagePullPolicy: IfNotPresent
25 
26     VLLMWorker:
27       envFromSecret: hf-token-secret
28       componentType: worker
29       replicas: 4
30       resources:
31         limits:
32           gpu: "2"
33       sharedMemory:
34         size: 16Gi            # Required for vLLM
35       volumeMounts:
36         - name: model-cache
37           mountPoint: /opt/models
38       envs:
39         - name: HF_HOME
40           value: /opt/models
41       extraPodSpec:
42         mainContainer:
43           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
44           workingDir: /workspace
45           imagePullPolicy: IfNotPresent
46           command:
47             - python3
48             - -m
49             - dynamo.vllm
50           args:
51             - --model
52             - "Qwen/Qwen3-32B-FP8"
53             - "--no-enable-prefix-caching"
54             - "--tensor-parallel-size"
55             - "2"
56             - "--pipeline-parallel-size"
57             - "1"
58             - "--data-parallel-size"
59             - "1"
60             - "--kv-cache-dtype"
61             - "fp8"
62             - "--max-model-len"
63             - "6000"
64             - "--max-num-seqs"
65             - "1024"

Key deployment settings:

Setting	Purpose	Notes
`backendFramework: vllm`	Tells Dynamo which runtime to use	Required at spec level
`pvcs` + `volumeMounts`	Caches model weights across restarts	Mount at `/opt/models` (not `/root/`)
`HF_HOME` env var	Points HuggingFace to cache location	Must match `mountPoint`
`sharedMemory.size: 16Gi`	IPC memory for vLLM	16Gi for vLLM, 80Gi for TRT-LLM
`envFromSecret`	Injects HF_TOKEN	Required for gated models

Step 4: Validate with AIPerf

After deployment, validate the predictions against actual performance using AIPerf.

ℹ️ Run AIPerf inside the cluster to avoid network latency affecting measurements.

AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when --save-dir ... is specified). For Kubernetes deployments, you can run benchmarks using k8s_bench.yaml; while for bare-metal systems, use the bench_run.sh script. These scripts execute AIPerf across a concurrency list: the default set (1 2 8 16 32 64 128) along with BenchConfig.estimated_concurrency and its values within ±5%. You can also customize this concurrency list as needed.

By default, AIPerf results will be saved in /tmp/bench_artifacts of the containers. If PVC name is specified in --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC, result artifacts will be saved in the PVC volume mount instead.

AIC-to-AIPerf parameter mapping

AIC Output	AIPerf Parameter	Notes
`concurrency: 56 (=14x4)`	`--concurrency 56`	Use total concurrency when benchmarking via the frontend
ISL/OSL targets	`--isl 4000 --osl 500`	Match your AIC inputs
-	`--num-requests 800`	Use `concurrency × 40` minimum for statistical stability
-	`--extra-inputs "ignore_eos:true"`	Ensures exact OSL tokens generated

Note on concurrency: AIC reports concurrency as total (=bs × replicas). When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica bs value instead.

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: aiperf-benchmark
5   namespace: your-namespace
6 spec:
7   template:
8     spec:
9       restartPolicy: Never
10       containers:
11       - name: aiperf
12         image: python:3.10
13         command:
14         - /bin/bash
15         - -c
16         - |
17           pip install aiperf
18           aiperf profile \
19             -m Qwen/Qwen3-32B-FP8 \
20             --endpoint-type chat \
21             -u http://dynamo-agg-frontend:8000 \
22             --isl 4000 --isl-stddev 0 \
23             --osl 500 --osl-stddev 0 \
24             --num-requests 800 \
25             --concurrency 56 \
26             --streaming \
27             --extra-inputs "ignore_eos:true" \
28             --num-warmup-requests 40 \
29             --ui-type simple

$ kubectl apply -f k8s_bench.yaml
$ kubectl logs -f -l job-name=aiperf-benchmark

Validated results (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):

Metric	AIC Prediction	Actual (avg)	Status
TTFT (ms)	509	209	Better than target
ITL/TPOT (ms)	16.49	15.06	Within 10%
Throughput (req/s)	~6.3	6.9	Within 10%
Total Output TPS	~3,178	3,462	Within 10%

Actual throughput typically reaches ~85-90% of AIC predictions, with ITL/TPOT being the most accurate metric. Expect some variance between benchmark runs; running multiple times is recommended. Enable prefix caching (--enable-prefix-caching) for additional TTFT improvements with repeated prompts.

Fine-Tuning Your Deployment

AIConfigurator provides a strong starting point. Here’s how to iterate for production:

Adjusting for Actual Workload

If your real workload differs from the benchmark parameters:

$ # For longer outputs (chat/code generation):
$ # increase OSL, relax TTFT target
$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 8 \
>   --system h200_sxm \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --isl 2000 \
>   --osl 2000 \
>   --ttft 1000 \
>   --tpot 10 \
>   --save-dir ./results_long_output

Exploring Alternative Configurations

Use exp mode to compare custom configurations:

1 # custom_exp.yaml
2 exps:
3   - exp_tp2
4   - exp_tp4
5 
6 exp_tp2:
7   mode: "patch"
8   serving_mode: "agg"
9   model_path: "Qwen/Qwen3-32B-FP8"
10   total_gpus: 8
11   system_name: "h200_sxm"
12   backend_name: "vllm"
13   backend_version: "0.12.0"
14   isl: 4000
15   osl: 500
16   ttft: 600
17   tpot: 16.67
18   config:
19     agg_worker_config:
20       tp_list: [2]
21 
22 exp_tp4:
23   mode: "patch"
24   serving_mode: "agg"
25   model_path: "Qwen/Qwen3-32B-FP8"
26   total_gpus: 8
27   system_name: "h200_sxm"
28   backend_name: "vllm"
29   backend_version: "0.12.0"
30   isl: 4000
31   osl: 500
32   ttft: 600
33   tpot: 16.67
34   config:
35     agg_worker_config:
36       tp_list: [4]

$ aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom

Critical: Disaggregated deployments require RDMA for KV cache transfer. Without RDMA, performance degrades by 40x (TTFT increases from 355ms to 10+ seconds). See the Disaggregated Deployment section below.

Deploying Disaggregated (RDMA Required)

Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA, this transfer becomes a severe bottleneck, causing 40x performance degradation.

Prerequisites for Disaggregated

RDMA-capable network (InfiniBand or RoCE)
RDMA device plugin installed on the cluster (provides rdma/ib resources)
ETCD and NATS deployed (for coordination)

Disaggregated DGD with RDMA

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: dynamo-disagg
5   namespace: your-namespace
6 spec:
7   backendFramework: vllm
8   pvcs:
9     - name: model-cache
10       create: false
11   services:
12     Frontend:
13       componentType: frontend
14       replicas: 1
15       volumeMounts:
16         - name: model-cache
17           mountPoint: /opt/models
18       envs:
19         - name: HF_HOME
20           value: /opt/models
21       extraPodSpec:
22         mainContainer:
23           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
24           imagePullPolicy: IfNotPresent
25 
26     VLLMPrefillWorker:
27       envFromSecret: hf-token-secret
28       componentType: worker
29       subComponentType: prefill
30       replicas: 2
31       resources:
32         limits:
33           gpu: "2"
34       sharedMemory:
35         size: 16Gi
36       volumeMounts:
37         - name: model-cache
38           mountPoint: /opt/models
39       envs:
40         - name: HF_HOME
41           value: /opt/models
42         - name: UCX_TLS
43           value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"  # Enable RDMA transports
44         - name: UCX_RNDV_SCHEME
45           value: "get_zcopy"
46         - name: UCX_RNDV_THRESH
47           value: "0"
48       extraPodSpec:
49         mainContainer:
50           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
51           workingDir: /workspace
52           imagePullPolicy: IfNotPresent
53           securityContext:
54             capabilities:
55               add: ["IPC_LOCK"]  # Required for RDMA memory registration
56           resources:
57             limits:
58               rdma/ib: "2"      # Request RDMA resources
59             requests:
60               rdma/ib: "2"
61           command: ["python3", "-m", "dynamo.vllm"]
62           args:
63             - --model
64             - "Qwen/Qwen3-32B-FP8"
65             - "--tensor-parallel-size"
66             - "2"
67             - "--kv-cache-dtype"
68             - "fp8"
69             - "--max-num-seqs"
70             - "1"               # Prefill workers use batch size 1
71             - --disaggregation-mode
72             - prefill
73 
74     VLLMDecodeWorker:
75       envFromSecret: hf-token-secret
76       componentType: worker
77       subComponentType: decode
78       replicas: 1
79       resources:
80         limits:
81           gpu: "4"
82       sharedMemory:
83         size: 16Gi
84       volumeMounts:
85         - name: model-cache
86           mountPoint: /opt/models
87       envs:
88         - name: HF_HOME
89           value: /opt/models
90         - name: UCX_TLS
91           value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
92         - name: UCX_RNDV_SCHEME
93           value: "get_zcopy"
94         - name: UCX_RNDV_THRESH
95           value: "0"
96       extraPodSpec:
97         mainContainer:
98           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
99           workingDir: /workspace
100           imagePullPolicy: IfNotPresent
101           securityContext:
102             capabilities:
103               add: ["IPC_LOCK"]
104           resources:
105             limits:
106               rdma/ib: "4"
107             requests:
108               rdma/ib: "4"
109           command: ["python3", "-m", "dynamo.vllm"]
110           args:
111             - --model
112             - "Qwen/Qwen3-32B-FP8"
113             - "--tensor-parallel-size"
114             - "4"
115             - "--kv-cache-dtype"
116             - "fp8"
117             - "--max-num-seqs"
118             - "1024"            # Decode workers handle high concurrency
119             - --disaggregation-mode
120             - decode

Critical RDMA settings:

Setting	Purpose
`rdma/ib: "N"`	Request N RDMA resources (match TP size)
`IPC_LOCK` capability	Required for RDMA memory registration
`UCX_TLS` env var	Enables RDMA transports (rc_x, dc_x)
`UCX_RNDV_SCHEME=get_zcopy`	Zero-copy RDMA transfers

Verifying RDMA is Active

After deployment, check the worker logs for UCX initialization:

$ kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"

You should see:

NIXL INFO Backend UCX was instantiated

If you see only TCP transports, RDMA is not active - check your RDMA device plugin and resource requests.

Tuning vLLM-Specific Parameters

Override vLLM engine parameters with --generator-set:

$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 8 \
>   --system h200_sxm \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --isl 4000 --osl 500 \
>   --ttft 600 --tpot 16.67 \
>   --save-dir ./results_tuned \
>   --generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
>   --generator-set Workers.agg.max_num_seqs=2048

Run aiconfigurator cli default --generator-help to see all available parameters.

Prefix Caching Considerations

For workloads with repeated prefixes (e.g., system prompts):

Enable prefix caching when you have high prefix hit rates
Disable prefix caching (--no-enable-prefix-caching) for diverse prompts

AIConfigurator’s default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.

Supported Configurations

Backends and Versions

For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the support matrix CSV. This file is automatically generated and tested to ensure accuracy across all supported configurations.

You can also check if a system / framework version is supported via the aiconfigurator cli support command. For example:

$ aiconfigurator cli support --model Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5

Common Use Cases

$ # Strict latency SLAs (real-time chat)
$ aiconfigurator cli default \
>   --model meta-llama/Llama-3.1-70B \
>   --total-gpus 16 \
>   --system h200_sxm \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --ttft 200 --tpot 8
$ 
$ # High throughput (batch processing)
$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 32 \
>   --system h200_sxm \
>   --backend trtllm \
>   --ttft 2000 --tpot 50
$ 
$ # Request latency constraint (end-to-end SLA)
$ aiconfigurator cli default \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 16 \
>   --system h200_sxm \
>   --backend vllm \
>   --backend-version 0.12.0 \
>   --request-latency 12000 \
>   --isl 4000 --osl 500

Additional Options

$ # Web interface for interactive exploration
$ pip3 install aiconfigurator[webapp]
$ aiconfigurator webapp  # Visit http://127.0.0.1:7860
$ 
$ # Quick config generation (no parameter sweep)
$ aiconfigurator cli generate \
>   --model Qwen/Qwen3-32B-FP8 \
>   --total-gpus 8 \
>   --system h200_sxm \
>   --backend vllm
$ 
$ # Check model/system support
$ aiconfigurator cli support \
>   --model Qwen/Qwen3-32B-FP8 \
>   --system h200_sxm \
>   --backend vllm

Troubleshooting

AIConfigurator Issues

Model not found: Use the full HuggingFace path (e.g., Qwen/Qwen3-32B-FP8 not QWEN3_32B)

Backend version mismatch: Check supported versions with aiconfigurator cli support --model <model> --system <system> --backend <backend>

Deployment Issues

Pods crash with “Permission denied” on cache directory:

Mount the PVC at /opt/models instead of /root/.cache/huggingface
Set HF_HOME=/opt/models environment variable
Ensure the PVC has ReadWriteMany access mode

Workers stuck in CrashLoopBackOff:

Check logs: kubectl logs <pod-name> --previous
Verify sharedMemory.size is set (16Gi for vLLM, 80Gi for TRT-LLM)
Ensure HuggingFace token secret exists and is named correctly

Model download slow on every restart:

Add PVC for model caching (see deployment example above)
Verify volumeMounts and HF_HOME are configured on workers

“Context stopped or killed” errors (disaggregated only):

Deploy ETCD and NATS infrastructure (required for KV cache transfer)
See Dynamo Kubernetes Guide for platform setup

Performance Issues

OOM errors: Reduce --max-num-seqs or increase tensor parallelism

Performance below predictions:

Verify warmup requests are sufficient (40+ recommended)
Check for competing workloads on the cluster
Ensure KV cache memory fraction is optimized
Run benchmarks from inside the cluster to eliminate network latency

Disaggregated TTFT extremely high (10+ seconds): This is almost always caused by missing RDMA configuration. Without RDMA, KV cache transfer falls back to TCP and becomes a severe bottleneck.

To diagnose:

$ # Check if RDMA resources are allocated
$ kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"
$ 
$ # Check UCX transport in logs
$ kubectl logs <worker-pod> | grep -i "UCX\|transport"

To fix:

Ensure your cluster has RDMA device plugin installed
Add rdma/ib resource requests to worker pods
Add IPC_LOCK capability to security context
Add UCX environment variables (see Disaggregated Deployment section)

Disaggregated working but throughput lower than aggregated: For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:

Very long inputs (ISL > 8000) with short outputs
Workloads needing independent prefill/decode scaling