Is this page helpful?

KV Cache#

TensorRT supports using KV cache when doing LLM inference with transformers using the IKVCacheUpdateLayer (C++, Python).

The IKVCacheUpdateLayer has three inputs, one optional input, one output, and two attributes:

Inputs

(index 0) cache: The K/V cache tensor with shape [B, N, S_max, H]. Allocate this tensor and provide it as an input to the TensorRT network.
(index 1) update: The newly calculated K or V tensor. Its shape depends on the updateForm attribute:
- kPADDED_BHND (default): [B, N, S_new, H]
- kPACKED_NHD: [totalTokens, N, H]
(index 2) writeIndices: A tensor of shape [B] indicating where to start writing K/V updates for each sequence.

Optional input

(index 3) updateLengths: Cumulative token counts with shape [B + 1]. The first element must be 0 and the last element equals totalTokens, so the number of tokens for batch i is updateLengths[i + 1] - updateLengths[i]. Required when updateForm is kPACKED_NHD. Use setUpdateLengths to set this tensor.

Attributes

cacheMode: An enum that specifies the K/V cache update strategy. Currently, only kLINEAR is supported, which performs sequential updates to the cache based on the provided write indices.
updateForm: An enum (AttentionIOForm) that specifies the layout of the update tensor. The default is kPADDED_BHND. When set to kPACKED_NHD, the update tensor shape is [totalTokens, N, H] and the updateLengths tensor must be provided.

Output

The IKVCacheUpdateLayer has one output: the updated K/V cache tensor with shape [B, N, S_max, H].

Note

IKVCacheUpdateLayer does not own or allocate the cache memory. Memory management is handled by higher-level frameworks (such as TensorRT Edge-LLM) or by your application.

Warning

The cache input tensor and the output tensor must share the same device memory to enable in-place updates. Use IExecutionContext::setTensorAddress to explicitly bind both tensors to the same memory address. If different addresses are assigned, TensorRT reports an API error.

TensorRT also exposes the engine API ICudaEngine::getAliasedInputTensor(char const* outputTensorName) (available in C++ and Python) to retrieve the name of the aliased input tensor when an output tensor is required to be aliased with an input tensor.

The IKVCacheUpdateLayer has the same hardware support matrix as IAttention.