For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodality Support
    • Tool Calling
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
      • Request Migration
      • Request Cancellation
      • Graceful Shutdown
      • Request Rejection
      • Testing
    • Writing Python Workers in Dynamo
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Discovery Plane
    • Request Plane
    • Event Plane
    • Router Design
    • KVBM Design
    • Planner Design
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • Architecture
  • Configuration
  • Frontend Arguments
  • Dynamic Configuration via API
  • Set Thresholds
  • Get Current Thresholds
  • Busy Detection Logic
  • KV Cache Block Threshold
  • Prefill Token Threshold
  • Data-Parallel Rank Aggregation
  • Worker Load Monitoring
  • Metrics Collected
  • Rejection Behavior
  • Request Flow
  • Error Response
  • Client Retry Strategy
  • Monitoring
  • Prometheus Metrics
  • Example Prometheus Queries
  • Grafana Alerting
  • Tuning Thresholds
  • Conservative Settings (Latency-Focused)
  • Aggressive Settings (Throughput-Focused)
  • Disabled (No Rejection)
  • Best Practices
  • 1. Start Conservative, Then Tune
  • 2. Monitor Before Enabling
  • 3. Use Both Thresholds for Disaggregated Serving
  • 4. Coordinate with Autoscaling
  • Related Documentation
User GuidesFault Tolerance

Request Rejection (Load Shedding)

||View as Markdown|
Edit this page
Previous

Graceful Shutdown

Next

Testing

This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.

Overview

Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:

  • Cascading failures from resource exhaustion
  • Degraded latency for all requests
  • Out-of-memory conditions on GPU workers

When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.

Architecture

┌─────────────────┐
│ Worker Monitor │
│ (Background) │
└────────┬────────┘
│ Updates busy list
▼
┌──────────┐ ┌──────────┐ ┌─────────────────────┐ ┌──────────┐
│ Client │───▶│ Frontend │───▶│ Push Router │───▶│ Worker │
└──────────┘ └──────────┘ │ (checks busy list) │ └──────────┘
└─────────────────────┘
│
│ If all workers busy
▼
┌─────────────────────┐
│ HTTP 503 Error │
│ "All workers busy" │
└─────────────────────┘

Configuration

Frontend Arguments

Configure busy thresholds when starting the frontend:

$python -m dynamo.frontend \
> --active-decode-blocks-threshold 0.85 \
> --active-prefill-tokens-threshold 10000
ArgumentTypeDescription
--active-decode-blocks-thresholdfloat (0.0-1.0)KV cache block utilization threshold
--active-prefill-tokens-thresholdintPrefill token count threshold

Dynamic Configuration via API

Thresholds can be adjusted at runtime via the /busy_threshold endpoint:

Set Thresholds

$curl -X POST http://localhost:8000/busy_threshold \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "active_decode_blocks_threshold": 0.85,
> "active_prefill_tokens_threshold": 10000
> }'

Get Current Thresholds

$curl http://localhost:8000/busy_threshold

Response:

1{
2 "thresholds": [
3 {
4 "model": "Qwen/Qwen3-0.6B",
5 "active_decode_blocks_threshold": 0.85,
6 "active_prefill_tokens_threshold": 10000
7 }
8 ]
9}

Busy Detection Logic

Workers are marked as “busy” based on a dual-threshold system. A worker is considered busy when either threshold is exceeded.

KV Cache Block Threshold

Monitors the percentage of KV cache blocks in use:

busy = active_decode_blocks / kv_total_blocks > threshold

Example: With active_decode_blocks_threshold=0.85, a worker using 87% of its KV cache blocks is marked busy.

Prefill Token Threshold

Monitors the number of tokens currently being prefilled:

busy = active_prefill_tokens > threshold

Example: With active_prefill_tokens_threshold=10000, a worker prefilling 12,000 tokens is marked busy.

Data-Parallel Rank Aggregation

For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if ALL ranks are busy:

1def is_busy(worker):
2 return all(rank.is_busy() for rank in worker.dp_ranks)

This prevents false positives when only some ranks are temporarily loaded.

Worker Load Monitoring

The KvWorkerMonitor runs as a background task that:

  1. Subscribes to KV cache metrics events from workers
  2. Maintains load state for each worker instance
  3. Recalculates busy instances when metrics change
  4. Updates the router with the current busy list

Metrics Collected

Workers publish these metrics for monitoring:

MetricDescription
active_decode_blocksNumber of KV cache blocks currently in use
kv_total_blocksTotal KV cache blocks available
active_prefill_tokensNumber of tokens currently being prefilled

Rejection Behavior

Request Flow

  1. Request arrives at frontend
  2. Push router checks if busy threshold is configured
  3. If configured, router retrieves list of free (non-busy) instances
  4. If no free instances exist (but instances are registered):
    • Request is rejected with PipelineError::ServiceOverloaded
    • HTTP 503 response is returned to client

Error Response

When requests are rejected, clients receive:

1HTTP/1.1 503 Service Unavailable
2Content-Type: application/json
3
4{
5 "message": "Service temporarily unavailable: All workers are busy, please retry later",
6 "type": "service_unavailable",
7 "code": 503
8}

Client Retry Strategy

Clients should implement exponential backoff when receiving 503 responses:

1import time
2import random
3
4def send_with_retry(request, max_retries=5):
5 for attempt in range(max_retries):
6 response = send_request(request)
7 if response.status_code != 503:
8 return response
9
10 # Exponential backoff with jitter
11 wait_time = min(60, (2 ** attempt) + random.uniform(0, 1))
12 time.sleep(wait_time)
13
14 raise Exception("Max retries exceeded")

Monitoring

Prometheus Metrics

Track rejection behavior with these metrics:

MetricTypeDescription
dynamo_tasks_rejected_totalCounterTotal number of rejected tasks
dynamo_queued_requestsGaugeRequests waiting in HTTP queue

Example Prometheus Queries

1# Rejection rate over 5 minutes
2rate(dynamo_tasks_rejected_total[5m])
3
4# Percentage of requests rejected
5sum(rate(dynamo_tasks_rejected_total[5m])) /
6sum(rate(dynamo_tasks_issued_total[5m])) * 100

Grafana Alerting

Example alert for high rejection rate:

1alert: HighRequestRejectionRate
2expr: |
3 sum(rate(dynamo_tasks_rejected_total[5m])) /
4 sum(rate(dynamo_tasks_issued_total[5m])) > 0.1
5for: 5m
6labels:
7 severity: warning
8annotations:
9 summary: "High request rejection rate"
10 description: "More than 10% of requests are being rejected"

Tuning Thresholds

Conservative Settings (Latency-Focused)

For applications prioritizing low latency:

$--active-decode-blocks-threshold 0.70
$--active-prefill-tokens-threshold 5000
  • Rejects earlier, before workers become fully loaded
  • Maintains lower queue depths
  • Better tail latencies

Aggressive Settings (Throughput-Focused)

For applications prioritizing throughput:

$--active-decode-blocks-threshold 0.95
$--active-prefill-tokens-threshold 20000
  • Allows higher worker utilization
  • May increase latency variability
  • Better overall throughput

Disabled (No Rejection)

To disable request rejection entirely:

$# Simply don't set the threshold arguments
>python -m dynamo.frontend

Without thresholds configured, all requests are accepted regardless of worker load.

Best Practices

1. Start Conservative, Then Tune

Begin with conservative thresholds and increase based on observed behavior:

$# Start here
$--active-decode-blocks-threshold 0.75
$
$# Increase if rejection rate is too high
$--active-decode-blocks-threshold 0.85

2. Monitor Before Enabling

Observe worker load patterns before setting thresholds:

$# Watch KV cache utilization
$watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks'

3. Use Both Thresholds for Disaggregated Serving

In disaggregated deployments:

  • Use active_prefill_tokens_threshold for prefill workers
  • Use active_decode_blocks_threshold for decode workers

4. Coordinate with Autoscaling

If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:

1# HPA triggers at 70% utilization
2# Rejection at 85% provides buffer
3--active-decode-blocks-threshold 0.85

Related Documentation

  • Health Checks - Worker health monitoring
  • Metrics - Available Prometheus metrics
  • Request Migration - Handling failed requests