> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/guardrails/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/guardrails/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/guardrails/_mcp/server.

# Caching Instructions and Prompts

> Configure in-memory caching for LLM calls and KV cache reuse to improve performance and reduce latency.

The NVIDIA NeMo Guardrails library provides two caching strategies to reduce inference latency.
The in-memory model cache stores LLM responses and returns them for repeated prompts without calling the LLM again.
KV cache reuse is a NIM-level optimization that avoids computation of the system prompt on each NemoGuard NIM call.
You can enable either or both strategies independently.

Configure in-memory caching to avoid repeated LLM calls for identical prompts using LFU eviction.

How To

Enable KV cache reuse in NVIDIA NIM for LLMs to reduce inference latency for NemoGuard models.

How To