Request Rejection (Load Shedding)
Request Rejection (Load Shedding)
Request Rejection (Load Shedding)
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
Configure busy thresholds when starting the frontend:
Thresholds can be adjusted at runtime via the /busy_threshold endpoint:
Response:
Workers are marked as “busy” based on a dual-threshold system. A worker is considered busy when either threshold is exceeded.
Monitors the percentage of KV cache blocks in use:
Example: With active_decode_blocks_threshold=0.85, a worker using 87% of its KV cache blocks is marked busy.
Monitors the number of tokens currently being prefilled:
Example: With active_prefill_tokens_threshold=10000, a worker prefilling 12,000 tokens is marked busy.
For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if ALL ranks are busy:
This prevents false positives when only some ranks are temporarily loaded.
The KvWorkerMonitor runs as a background task that:
Workers publish these metrics for monitoring:
PipelineError::ServiceOverloadedWhen requests are rejected, clients receive:
Clients should implement exponential backoff when receiving 503 responses:
Track rejection behavior with these metrics:
Example alert for high rejection rate:
For applications prioritizing low latency:
For applications prioritizing throughput:
To disable request rejection entirely:
Without thresholds configured, all requests are accepted regardless of worker load.
Begin with conservative thresholds and increase based on observed behavior:
Observe worker load patterns before setting thresholds:
In disaggregated deployments:
active_prefill_tokens_threshold for prefill workersactive_decode_blocks_threshold for decode workersIf using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling: