Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with ModelType.Prefill, the frontend automatically detects them and activates an internal prefill router.
For the high-level deployment matrix, see Router Guide. For the router flags used in this setup, see Configuration and Tuning.
The prefill router is automatically created when:
register_model() with ModelType.Chat | ModelType.Completions.ModelType.Prefill.Key characteristics of the prefill router:
track_active_blocks=false) since prefill workers do not perform decode.Key characteristics of the decode routing stage in disaggregated mode:
overlap_score_weight=0) because decode routing should not chase prefix reuse.assume_kv_reuse=false) unless the backend can truly deduplicate transferred blocks.track_prefill_tokens=false) so decode-side load reflects decode work rather than already-completed prompt work.When both workers are registered, requests are automatically routed.
The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang, launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (python -m dynamo.router) uses --router-*-prefixed flags such as --router-block-size and --router-kv-events. See the Standalone Router README and examples/backends/sglang/launch/disagg_router.sh.
The following diagram shows an overview of the major components in disaggregated serving: