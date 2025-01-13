Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1 . See configuration documentation for more information.

When to use#

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

Large table input followed by a question about the table

Same large table input followed by a different question about the table

and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

You can use the following script to demonstrate the speedup: