KV Cache Reuse (a.k.a. prefix caching)#
How to use#
Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.
When to use#
In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.
For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).
Example:
Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…
KV Cache reuse will speed up TTFT starting on the second request and following.