SLA-based Planner
SLA-based Planner
SLA-based Planner
New to SLA Planner? For a complete workflow including profiling and deployment, see the SLA Profiling + Planner Quick Start Guide.
This document covers information regarding the SLA-based planner in examples/common/utils/planner_core.py.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
Currently, SLA-based planner only supports disaggregated setup.
Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
Components:
/metricsThe adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this file.
The SLA planner consists of several key components:
Prerequisite: SLA-based planner requires pre-deployment profiling to be completed before deployment. The profiling process analyzes your model’s performance characteristics to determine optimal tensor parallelism configurations and scaling parameters that the planner will use during operation.
See Pre-Deployment Profiling for detailed instructions on running the profiling process.
The SLA planner use load predictor to predict the number of requests, ISL, and OSL in the next adjustment interval. Currently, three load prediction model is supported:
load-predictor: "constant"load-predictor: "arima"load-predictor: "prophet"SLA planner uses a sophisticated scaling algorithm. At each adjustment interval, SLA planner performs the following operations:
Every adjustment interval, collect:
Using the collected metrics, SLA planner applies the interpolator to find out the expected TTFT/ITL and calibrate the interpolation model. This step is important because the actual TTFT/ITL can often be different than the ideal world:
SLA planner calculate the correction factor with
actual_ttft / expected_ttftactual_itl / expected_itlSLA planner forecasts these metric in the next interval using the load predictor
Prefill replicas: SLA planner assumes the prefill correction factor has linear affect on the prefill throughput per GPU as prefill is single-batched.
Decode replicas:
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
SLA-planner scales up/down the P/D engines non-blockingly. If adjustment-interval is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough adjustment-interval.
For complete deployment instructions, see the SLA Planner Quick Start Guide.
The SLA planner requires a frontend that reports metrics at the /metrics HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the VirtualConnector. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure.
The VirtualConnector acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion.
"No scaling needed (prefill=X, decode=Y)"scaled_decision_id >= decision_idTo use virtual deployment mode:
The external deployment environment must use VirtualConnectorClient:
await client.wait(). This blocks until there is a change.num_prefill_workers and num_decode_workers values: decision = await client.get()await client.complete(decision)A scaling decision (returned by client.get()) contains the following fields, which are -1 if not set yet:
num_prefill_workers: Integer specifying the target number of prefill workersnum_decode_workers: Integer specifying the target number of decode workersdecision_id: Integer with incremental ID for each scaling decisionSee components/planner/test/test_virtual_connector.py for a full example.