Caching Instructions and Prompts | NVIDIA NeMo Guardrails Library Developer Guide

The NVIDIA NeMo Guardrails library provides two caching strategies to reduce inference latency. The in-memory model cache stores LLM responses and returns them for repeated prompts without calling the LLM again. KV cache reuse is a NIM-level optimization that avoids computation of the system prompt on each NemoGuard NIM call. You can enable either or both strategies independently.

Memory Model Cache

Configure in-memory caching to avoid repeated LLM calls for identical prompts using LFU eviction.

How To

KV Cache Reuse

Enable KV cache reuse in NVIDIA NIM for LLMs to reduce inference latency for NemoGuard models.

How To