Caching

Intro

A RAG pipeline repeats expensive work on every query: embedding the question, searching the index, and generating an answer from an LLM. Caching eliminates that repetition by storing results at each stage so subsequent queries can skip the computation entirely. The payoff is lower latency, lower cost, and reduced load on embedding models, vector databases, and LLMs.

The correct model is layered caching — a separate cache at each pipeline stage with its own key design, TTL policy, and invalidation trigger. A single cache at one layer does not protect you; embedding costs are wasted if you only cache responses, and response caching alone misses the opportunity to serve sub-second retrievals.

The hard part specific to RAG is that cache correctness is a security problem, not just a freshness problem. If cache keys omit authorization context, a query from an authorized user can populate the cache with evidence that a second, unauthorized user later receives. Every cache layer must include permission-scoping fields in its key.

sequenceDiagram
  participant App
  participant EC as Embedding Cache
  participant EM as Embedding Model
  participant RC as Retrieval Cache
  participant VDB as Vector DB
  participant LC as Response Cache
  participant LLM

  App->>EC: hash query + model ver
  alt Cache hit
    EC-->>App: stored vector
  else Cache miss
    App->>EM: embed query
    EM-->>App: vector
    App->>EC: store vector
  end

  App->>RC: hash query + filters + tenant + index ver
  alt Cache hit
    RC-->>App: doc IDs + scores
  else Cache miss
    App->>VDB: ANN search
    VDB-->>App: doc IDs + scores
    App->>RC: store results
  end

  Note over App: assemble context from docs

  App->>LC: hash prompt + context + model ver
  alt Cache hit
    LC-->>App: cached answer
  else Cache miss
    App->>LLM: generate
    LLM-->>App: answer
    App->>LC: store response
  end

Embedding Cache

How it works:

Main risk:

Retrieval Cache

How it works:

Main risk:

LLM Response Cache

LLM response caching operates at two levels that solve different problems.

Provider-level prompt caching (KV cache reuse):

Application-level response caching (exact or semantic match):

Main risk:

Pitfalls

Questions

References


Whats next