This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
Configure busy thresholds when starting the frontend:
Thresholds can be adjusted at runtime via the /busy_threshold endpoint:
Response:
Workers are marked as “busy” based on a dual-threshold system. A worker is considered busy when either threshold is exceeded.
Monitors the percentage of KV cache blocks in use:
Example: With active_decode_blocks_threshold=0.85, a worker using 87% of its KV cache blocks is marked busy.
Monitors the number of tokens currently being prefilled:
Example: With active_prefill_tokens_threshold=10000, a worker prefilling 12,000 tokens is marked busy.
For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if ALL ranks are busy:
This prevents false positives when only some ranks are temporarily loaded.
The KvWorkerMonitor runs as a background task that:
Workers publish these metrics for monitoring:
PipelineError::ServiceOverloadedWhen requests are rejected, clients receive:
Clients should implement exponential backoff when receiving 503 responses:
Track rejection behavior with these metrics:
dynamo_frontend_model_rejection_total: Counter tracking the total number of requests rejected due to resource exhaustion
model: The model name being servedendpoint: The API endpoint that received the request (e.g., chat_completions, completions, embeddings)ResourceExhausted error because all workers are busy. The rejected request is surfaced to the client as an HTTP 503 response.Example metrics output:
Endpoint: Available on the frontend HTTP service at /metrics.
For applications prioritizing low latency:
For applications prioritizing throughput:
To disable request rejection entirely:
Without thresholds configured, all requests are accepted regardless of worker load.
Begin with conservative thresholds and increase based on observed behavior:
Observe worker load patterns before setting thresholds:
In disaggregated deployments:
active_prefill_tokens_threshold for prefill workersactive_decode_blocks_threshold for decode workersIf using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling: