Disaggregated Serving
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with ModelType.Prefill, the frontend automatically detects them and activates an internal prefill router.
For the high-level deployment matrix, see Router Guide. For the router flags used in this setup, see Configuration and Tuning.
Automatic Prefill Router Activation
The prefill router is automatically created when:
- A decode model is registered, for example via
register_model()withModelType.Chat | ModelType.Completions. - A prefill worker is detected with the same model name and
ModelType.Prefill.
Key characteristics of the prefill router:
- Always disables active block tracking (
track_active_blocks=false) since prefill workers do not perform decode. - Seamlessly integrates into the request pipeline between preprocessing and decode routing.
- Falls back gracefully to decode-only mode if prefill fails or no prefill workers are available.
Key characteristics of the decode routing stage in disaggregated mode:
- Disables overlap scoring (
overlap_score_weight=0) because decode routing should not chase prefix reuse. - Disables KV reuse assumption (
assume_kv_reuse=false) unless the backend can truly deduplicate transferred blocks. - Disables prefill-token tracking (
track_prefill_tokens=false) so decode-side load reflects decode work rather than already-completed prompt work.
Setup Example
When both workers are registered, requests are automatically routed.
The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang, launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (python -m dynamo.router) uses --router-*-prefixed flags such as --router-block-size and --router-kv-events. See the Standalone Router README and examples/backends/sglang/launch/disagg_router.sh.
Request Flow
The following diagram shows an overview of the major components in disaggregated serving: