Large Language Models (Latest)
Large Language Models (Latest)

KV Cache Reuse (a.k.a. prefix caching)

Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

  • Large table input followed by a question about the table

  • Same large table input followed by a different question about the table

  • Same large table input followed by a different question about the table

  • and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

You can use the following script to demonstrate the speedup:

Copy
Copied!
            

import time import requests import json # Define your model endpoint URL API_URL = "http://0.0.0.0:8000/v1/chat/completions" # Function to send a request to the API and return the response time def send_request(model, messages, max_tokens=15): data = { "model": model, "messages": messages, "max_tokens": max_tokens, "top_p": 1, "frequency_penalty": 1.0 } headers = { "accept": "application/json", "Content-Type": "application/json" } start_time = time.time() response = requests.post(API_URL, headers=headers, data=json.dumps(data)) end_time = time.time() output = response.json() print(f"Output:{output['choices'][0]['message']['content']}") print(f"Generation time:{end_time - start_time:.4f}seconds") return end_time - start_time # Test function demonstrating caching with a long prompt def test_prefix_caching(): model = "your_model_name_here" # Long document to simulate complex input LONG_PROMPT = """# Table of People\n""" + \ "| ID | Name | Age | Occupation | Country |\n" + \ "|-----|---------------|-----|---------------|---------------|\n" + \ "| 1 | John Doe | 29 | Engineer | USA |\n" + \ "| 2 | Jane Smith | 34 | Doctor | Canada |\n" * 50 # Replicating rows to make the table long # First query (no caching) messages_1 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the age of John Doe?"}] print("\nFirst query (no caching):") send_request(model, messages_1) # Second query (prefix caching enabled) messages_2 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the occupation of Jane Smith?"}] print("\nSecond query (with prefix caching):") send_request(model, messages_2) if __name__ == "__main__": test_prefix_caching()

Previous Parameter-Efficient Fine-Tuning
Next Acknowledgements
© Copyright © 2024, NVIDIA Corporation. Last updated on Nov 28, 2024.