For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
      • Reference Guide
      • Examples
      • Prometheus Metrics
      • Video Diffusion (Experimental)
      • Known Issues and Mitigations
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
BackendsTensorRT-LLM

Known Issues and Mitigations

||View as Markdown|
Edit this page
Previous

Video Diffusion Support (Experimental)

Next

vLLM

For general TensorRT-LLM features and configuration, see the Reference Guide.


KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)

Issue: In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.

Symptoms:

  • Workers function normally initially but hang after heavy load testing
  • Inference requests get stuck and eventually timeout
  • Logs show warnings: num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache
  • Error logs may contain: asyncio.exceptions.InvalidStateError: invalid state

Root Cause: When max_tokens_in_buffer in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.

Mitigation: Ensure max_tokens_in_buffer exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., prefill.yaml and decode.yaml):

1cache_transceiver_config:
2 backend: DEFAULT
3 max_tokens_in_buffer: 65536 # Must exceed max ISL

For example, see examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml.

Related Issue: #4327