Disaggregated Serving

Prefill and decode routing with the Dynamo router
View as Markdown

Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with ModelType.Prefill, the frontend automatically detects them and activates an internal prefill router.

For the high-level deployment matrix, see Router Guide. For the router flags used in this setup, see Configuration and Tuning.

Automatic Prefill Router Activation

The prefill router is automatically created when:

  1. A decode model is registered, for example via register_model() with ModelType.Chat | ModelType.Completions.
  2. A prefill worker is detected with the same model name and ModelType.Prefill.

Key characteristics of the prefill router:

  • Always disables active block tracking (track_active_blocks=false) since prefill workers do not perform decode.
  • Seamlessly integrates into the request pipeline between preprocessing and decode routing.
  • Falls back gracefully to decode-only mode if prefill fails or no prefill workers are available.

Key characteristics of the decode routing stage in disaggregated mode:

  • Disables overlap scoring (overlap_score_weight=0) because decode routing should not chase prefix reuse.
  • Disables KV reuse assumption (assume_kv_reuse=false) unless the backend can truly deduplicate transferred blocks.
  • Disables prefill-token tracking (track_prefill_tokens=false) so decode-side load reflects decode work rather than already-completed prompt work.

Setup Example

When both workers are registered, requests are automatically routed.

1# Decode worker registration (in your decode worker)
2decode_endpoint = runtime.endpoint("dynamo.decode.generate")
3
4await register_model(
5 model_input=ModelInput.Tokens,
6 model_type=ModelType.Chat | ModelType.Completions,
7 endpoint=decode_endpoint,
8 model_name="meta-llama/Llama-2-7b-hf",
9 # ... other parameters
10)
11
12await decode_endpoint.serve_endpoint(decode_handler.generate)
13
14# Prefill worker registration (in your prefill worker)
15prefill_endpoint = runtime.endpoint("dynamo.prefill.generate")
16
17await register_model(
18 model_input=ModelInput.Tokens,
19 model_type=ModelType.Prefill,
20 endpoint=prefill_endpoint,
21 model_name="meta-llama/Llama-2-7b-hf",
22 # ... other parameters
23)
24
25await prefill_endpoint.serve_endpoint(prefill_handler.generate)

The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang, launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (python -m dynamo.router) uses --router-*-prefixed flags such as --router-block-size and --router-kv-events. See the Standalone Router README and examples/backends/sglang/launch/disagg_router.sh.

Request Flow

The following diagram shows an overview of the major components in disaggregated serving: