KV Cache Reuse with NVIDIA NIM for LLMs#

You can enable KV cache reuse, also known as prefix caching, by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. For more information, refer to Configure Your NIM with NVIDIA NIM for LLMs.

When to use#

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

You can use the following script to demonstrate the speedup:

import time
import requests
import json

# Define your model endpoint URL
API_URL = "http://0.0.0.0:8000/v1/chat/completions"

# Function to send a request to the API and return the response time
def send_request(model, messages, max_tokens=15):
    data = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "top_p": 1,
        "frequency_penalty": 1.0
    }

    headers = {
        "accept": "application/json",
        "Content-Type": "application/json"
    }

    start_time = time.time()
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    end_time = time.time()

    output = response.json()
    print(f"Output: {output['choices'][0]['message']['content']}")
    print(f"Generation time: {end_time - start_time:.4f} seconds")
    return end_time - start_time

# Test function demonstrating caching with a long prompt
def test_prefix_caching():
    model = "your_model_name_here"

    # Long document to simulate complex input
    LONG_PROMPT = """# Table of People\n""" + \
    "| ID  | Name          | Age | Occupation    | Country       |\n" + \
    "|-----|---------------|-----|---------------|---------------|\n" + \
    "| 1   | John Doe      | 29  | Engineer      | USA           |\n" + \
    "| 2   | Jane Smith    | 34  | Doctor        | Canada        |\n" * 50  # Replicating rows to make the table long

    # First query (no caching)
    messages_1 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the age of John Doe?"}]
    print("\nFirst query (no caching):")
    send_request(model, messages_1)

    # Second query (prefix caching enabled)
    messages_2 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the occupation of Jane Smith?"}]
    print("\nSecond query (with prefix caching):")
    send_request(model, messages_2)

if __name__ == "__main__":
    test_prefix_caching()

Offloading KV Cache to Host Memory#

NVIDIA NIM for LLMs can offload KV cache blocks to host memory to improve cache reuse, but only when KV cache reuse is enabled. To control this feature, configure the environment variable NIM_ENABLE_KV_CACHE_HOST_OFFLOAD:

Set to 1 to enable host-based KV cache offloading.
Set to 0 to disable.
Leave unset (None) to let NVIDIA NIM for LLMs choose the optimal strategy for your system.

Note

For this feature to work with the TensorRT-LLM backend, you need to also set NIM_ENABLE_KV_CACHE_REUSE=1.

Offloading to host memory helps keep more KV cache blocks available for reuse. When a reusable block is needed for a high-priority task (such as propagating an active request), it is copied to host memory instead of being evicted. This increases the total memory available for reuse, allowing blocks to remain available longer.

Some overhead exists when moving blocks between CPU and GPU memory. On systems with NVLink chip-to-chip (like Grace Hopper), this overhead is negligible. On x86 systems with Hopper GPUs, the benefit usually outweighs the cost. On older architectures, the slower connection between GPU and host memory may reduce or eliminate the benefit.

You can control how much host memory is used for offloading with the NIM_KV_CACHE_HOST_MEM_FRACTION environment variable. This setting determines the fraction of free host memory to use as a buffer for offloading. By default, this is set to 0.1 (10% of available free host memory). Increasing this value allows more blocks to be offloaded, but also increases host memory usage. Adjust this setting based on your workload and available system memory.