For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Installation
    • Support Matrix
    • Feature Matrix
    • Examples
  • Kubernetes Deployment
  • User Guides
    • Tool Calling
    • Multimodality Support
    • Finding Best Initial Configs
    • Dynamo Benchmarking Guide
    • Tuning Disaggregated Performance
    • Writing Python Workers in Dynamo
    • Glossary
  • Components
    • Router
      • Overview
      • SLA Planner Quick Start
      • SLA-Driven Profiling
      • SLA-based Planner
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Feature Support Matrix
  • Footnotes
ComponentsPlanner

Planner

||View as Markdown|
Edit this page
Previous

KV Router

Next

SLA-Driven Profiling and Planner Deployment Quick Start Guide

The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.

Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:

Key features include:

  • SLA-based scaling that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
  • Graceful scaling that ensures no requests are dropped during scale-down operations
🚀 Quick Start

New to SLA Planner? Start with the SLA Planner Quick Start Guide for a complete, step-by-step workflow.

Prerequisites: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.

Feature Support Matrix

CategoryStatusFeature
Backend❌Local
✅Kubernetes
LLM Framework✅vLLM
✅TensorRT-LLM
✅SGLang
Serving Type✅Aggregated
✅Disaggregated
Planner Actions❌Load-based scaling up/down prefill/decode workers
✅SLA-based scaling up/down prefill/decode workers 1
❌Adjusting engine knobs

Footnotes

  1. Supported with some limitations. ↩