Disaggregated Serving
Find optimal prefill/decode configuration for disaggregated serving deployments
Find optimal prefill/decode configuration for disaggregated serving deployments
AIConfigurator is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
When deploying LLMs with Dynamo, you need to make several critical decisions:
AIConfigurator answers these questions in seconds, providing:
AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:
This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.
Parameters explained:
--model: HuggingFace model ID or local path (e.g., Qwen/Qwen3-32B-FP8)--system: GPU system type (h200_sxm, h100_sxm, a100_sxm)--total-gpus: Number of GPUs available for deployment--isl / --osl: Input/Output sequence lengths in tokens--ttft / --tpot: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)--backend: Inference backend (vllm, trtllm, or sglang)--backend-version: Backend version (e.g., 0.12.0 for vLLM)--save-dir: Directory to save generated deployment configsAIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:
Reading the output:
56 (=14x4) means batch size 14 × 4 replicas)The --save-dir generates ready-to-use Kubernetes manifests:
Before deploying, ensure you have:
HuggingFace Token Secret (for gated models):
Model Cache PVC (recommended for faster restarts):
The generated k8s_deploy.yaml provides a starting point. You’ll typically need to customize it for your environment:
Complete deployment example with model cache and production settings:
Key deployment settings:
After deployment, validate the predictions against actual performance using AIPerf.
ℹ️ Run AIPerf inside the cluster to avoid network latency affecting measurements.
AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when --save-dir ... is specified). For Kubernetes deployments, you can run benchmarks using k8s_bench.yaml; while for bare-metal systems, use the bench_run.sh script. These scripts execute AIPerf across a concurrency list: the default set (1 2 8 16 32 64 128) along with BenchConfig.estimated_concurrency and its values within ±5%. You can also customize this concurrency list as needed.
By default, AIPerf results will be saved in /tmp/bench_artifacts of the containers. If PVC name is specified in --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC, result artifacts will be saved in the PVC volume mount instead.
Note on concurrency: AIC reports concurrency as
total (=bs × replicas). When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replicabsvalue instead.
Validated results (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):
Actual throughput typically reaches ~85-90% of AIC predictions, with ITL/TPOT being the most accurate metric. Expect some variance between benchmark runs; running multiple times is recommended. Enable prefix caching (--enable-prefix-caching) for additional TTFT improvements with repeated prompts.
AIConfigurator provides a strong starting point. Here’s how to iterate for production:
If your real workload differs from the benchmark parameters:
Use exp mode to compare custom configurations:
Critical: Disaggregated deployments require RDMA for KV cache transfer. Without RDMA, performance degrades by 40x (TTFT increases from 355ms to 10+ seconds). See the Disaggregated Deployment section below.
Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA, this transfer becomes a severe bottleneck, causing 40x performance degradation.
rdma/ib resources)Critical RDMA settings:
After deployment, check the worker logs for UCX initialization:
You should see:
If you see only TCP transports, RDMA is not active - check your RDMA device plugin and resource requests.
Override vLLM engine parameters with --generator-set:
Run aiconfigurator cli default --generator-help to see all available parameters.
For workloads with repeated prefixes (e.g., system prompts):
--no-enable-prefix-caching) for diverse promptsAIConfigurator’s default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.
For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the support matrix CSV. This file is automatically generated and tested to ensure accuracy across all supported configurations.
You can also check if a system / framework version is supported via the aiconfigurator cli support command. For example:
Model not found: Use the full HuggingFace path (e.g., Qwen/Qwen3-32B-FP8 not QWEN3_32B)
Backend version mismatch: Check supported versions with aiconfigurator cli support --model <model> --system <system> --backend <backend>
Pods crash with “Permission denied” on cache directory:
/opt/models instead of /root/.cache/huggingfaceHF_HOME=/opt/models environment variableReadWriteMany access modeWorkers stuck in CrashLoopBackOff:
kubectl logs <pod-name> --previoussharedMemory.size is set (16Gi for vLLM, 80Gi for TRT-LLM)Model download slow on every restart:
volumeMounts and HF_HOME are configured on workers“Context stopped or killed” errors (disaggregated only):
OOM errors: Reduce --max-num-seqs or increase tensor parallelism
Performance below predictions:
Disaggregated TTFT extremely high (10+ seconds): This is almost always caused by missing RDMA configuration. Without RDMA, KV cache transfer falls back to TCP and becomes a severe bottleneck.
To diagnose:
To fix:
rdma/ib resource requests to worker podsIPC_LOCK capability to security contextDisaggregated working but throughput lower than aggregated: For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for: