KV Cache#
TensorRT supports using KV cache when doing LLM inference with transformers using the IKVCacheUpdateLayer (C++, Python).
The IKVCacheUpdateLayer has three inputs, one optional input, one output, and two attributes:
Inputs
(index 0) cache: The K/V cache tensor with shape[B, N, S_max, H]. Allocate this tensor and provide it as an input to the TensorRT network.(index 1) update: The newly calculated K or V tensor. Its shape depends on theupdateFormattribute:kPADDED_BHND(default):[B, N, S_new, H]kPACKED_NHD:[totalTokens, N, H]
(index 2) writeIndices: A tensor of shape[B]indicating where to start writing K/V updates for each sequence.
Optional input
(index 3) updateLengths: Cumulative token counts with shape[B + 1]. The first element must be0and the last element equalstotalTokens, so the number of tokens for batchiisupdateLengths[i + 1] - updateLengths[i]. Required whenupdateFormiskPACKED_NHD. UsesetUpdateLengthsto set this tensor.
Attributes
cacheMode: An enum that specifies the K/V cache update strategy. Currently, onlykLINEARis supported, which performs sequential updates to the cache based on the provided write indices.updateForm: An enum (AttentionIOForm) that specifies the layout of theupdatetensor. The default iskPADDED_BHND. When set tokPACKED_NHD, theupdatetensor shape is[totalTokens, N, H]and theupdateLengthstensor must be provided.
Output
The IKVCacheUpdateLayer has one output: the updated K/V cache tensor with shape [B, N, S_max, H].
Note
IKVCacheUpdateLayer does not own or allocate the cache memory. Memory management is handled by higher-level frameworks (such as TensorRT Edge-LLM) or by your application.
Warning
The cache input tensor and the output tensor must share the same device memory to enable in-place updates. Use IExecutionContext::setTensorAddress to explicitly bind both tensors to the same memory address. If different addresses are assigned, TensorRT reports an API error.
TensorRT also exposes the engine API ICudaEngine::getAliasedInputTensor(char const* outputTensorName) (available in C++ and Python) to retrieve the name of the aliased input tensor when an output tensor is required to be aliased with an input tensor.
The IKVCacheUpdateLayer has the same hardware support matrix as IAttention.