KV Cache Reuse (a.k.a. prefix caching)#

How to use#

Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.

When to use#

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different.

Example:

Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.