Planner#
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
Key features include:
Load-based scaling that monitors KV cache utilization and prefill queue size to make scaling decisions
SLA-based scaling that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
Multi-backend support for both local (Circus) and Kubernetes environments
Graceful scaling that ensures no requests are dropped during scale-down operations
Feature |
||
---|---|---|
Backend |
✅ |
Local |
✅ |
Kubernetes |
|
LLM Framework |
✅ |
vLLM |
❌ |
TensorRT-LLM |
|
❌ |
SGLang |
|
❌ |
llama.cpp |
|
Serving Type |
✅ |
Aggregated |
✅ |
Disaggregated |
|
Planner Actions |
✅ |
Load-based scaling up/down prefill/decode workers |
✅ |
SLA-based scaling up/down prefill/decode workers [1] |
|
❌ |
Adjusting engine knobs |