For general TensorRT-LLM features and configuration, see the Reference Guide.
Issue: In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
Symptoms:
num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCacheasyncio.exceptions.InvalidStateError: invalid stateRoot Cause: When max_tokens_in_buffer in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
Mitigation: Ensure max_tokens_in_buffer exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., prefill.yaml and decode.yaml):
For example, see examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml.
Related Issue: #4327