RAG

Intro

Retrieval-Augmented Generation (RAG) combines retrieval and generation: retrieve evidence from your corpus, then generate an answer grounded in that evidence. It matters because knowledge changes faster than model weights, and RAG lets you update knowledge without retraining the model.
In practice, strong RAG systems are pipelines, not prompts. The main engineering work is query processing, retrieval quality, context assembly, evaluation, and production operations.
Example: for a support assistant, a user asks "What changed in API v2 rate limits?". RAG retrieves release notes and policy docs first, then the model answers with citations to the exact source sections instead of guessing from stale parametric memory.

Core Flow

flowchart LR
    Q[User Query] --> T[Query Translation]
    T --> R[Retrieval and Fusion]
    R --> RR[Optional Reranking]
    RR --> C[Context Assembly]
    C --> G[LLM Generation]
    G --> V[Groundedness and Citation Checks]

Advanced RAG Patterns

Basic retrieve-then-generate works for straightforward factual lookups, but production systems hit failure modes that need more sophisticated patterns: multi-hop reasoning, uncertain retrieval quality, cross-modal evidence, and dynamic tool selection. Each pattern below solves a specific class of failures. Adopt incrementally after baseline RAG metrics plateau — every pattern adds latency, cost, and operational complexity.

flowchart TD
    Q[Query] --> CL[Classify Complexity]
    CL -->|Simple fact| SR[Single-Pass RAG]
    CL -->|Multi-hop or ambiguous| IR[Iterative Retrieval]
    CL -->|Low retrieval confidence| CR[CRAG]
    CL -->|Relational or connected facts| GR[Graph RAG]
    CL -->|Multiple data sources| AR[Agentic RAG]
    CL -->|Mixed modalities| MR[Multimodal RAG]

Iterative Retrieval

How it works:

context = []
for step in range(max_steps):
    thought = llm.reason(query, context)
    if has_sufficient_info(thought):
        break
    follow_up = extract_retrieval_query(thought)
    new_docs = retrieve(follow_up, top_k=3)
    context.extend(new_docs)
answer = llm.generate(query, context)

Where it fits:

Main risk:

Self-RAG

How it works:

Token Purpose Values
Retrieve Decide whether to call the retriever yes, no, continue
IsRel Grade if retrieved document is relevant to the query relevant, irrelevant
IsSup Check if the generation is supported by the document fully supported, partially supported, no support
IsUse Rate if the generation actually answers the query 1 through 5
retrieve_token = model.predict_token(input_x, preceding_y)
if retrieve_token == "yes":
    passages = retriever.get_top_k(input_x)
    candidates = []
    for passage in passages:
        segment = model.generate(input_x, passage)
        is_rel = model.predict_token("IsRel", input_x, passage)
        is_sup = model.predict_token("IsSup", input_x, passage, segment)
        is_use = model.predict_token("IsUse", input_x, segment)
        candidates.append((segment, score(is_rel, is_sup, is_use)))
    best = max(candidates, key=lambda x: x[1])

Where it fits:

Main risk:

CRAG — Corrective Retrieval-Augmented Generation

How it works:

Confidence Action Behavior
High — above upper threshold Correct Refine documents with decompose-then-recompose
Low — below lower threshold Incorrect Discard retrieved docs and fall back to web search
Middle — between thresholds Ambiguous Combine refined documents with web search results
scores = [evaluator.score(query, doc) for doc in retrieved_docs]
confidence = max(scores)
if confidence >= upper_threshold:
    knowledge = refine_strips(query, retrieved_docs)
elif confidence < lower_threshold:
    knowledge = web_search(rewrite_query(query))
else:
    knowledge = refine_strips(query, retrieved_docs) + web_search(rewrite_query(query))
answer = llm.generate(query, knowledge)

Where it fits:

Main risk:

Graph RAG

How it works:

1. Entity and relationship extraction. An LLM reads each chunk and extracts typed entities (Person, Organization, Concept, Event) and typed directed edges with properties. The main challenge is entity linking — disambiguating the same name across documents.

2. Community detection. The Leiden algorithm (successor to Louvain) partitions the graph into hierarchical communities. The number of levels depends on the graph and resolution parameters. As an illustrative example: fine-grained communities might represent individual subsystems, while coarser levels group those into broader domains. This precomputes semantic clusters so query-time retrieval does not need to traverse the full graph.

3. Community summarization. Each community gets an LLM-generated summary. Summaries are expensive to generate (multiple LLM calls across the hierarchy) but cheap to query.

At query time, the paper describes a map-reduce approach: community summaries are ranked for relevance, the LLM generates partial answers from each relevant summary (map), and a final synthesis step combines partial answers into a coherent response (reduce). The open-source GraphRAG library extends this with two query modes:

Where it fits:

Main risk:

Agentic RAG

How it works:

while iteration < max_iterations:
    thought = llm.reason(query, scratchpad)
    if thought.is_final_answer:
        return thought.answer
    tool, args = llm.select_tool(thought, available_tools)
    observation = execute_tool(tool, args)
    scratchpad.append(thought, tool, observation)
    iteration += 1

Where it fits:

Main risk:

Adaptive RAG

How it works:

Complexity Strategy Example
Simple No retrieval — LLM answers from parametric memory "What is 2+2?"
Moderate Single-pass RAG "What is the return policy?"
Complex Iterative multi-hop RAG "Compare implications across three papers"

Where it fits:

Main risk:

Speculative RAG

How it works:

flowchart TD
    D[Retrieved Docs] --> P[Partition into Subsets]
    P --> S1[Subset 1]
    P --> S2[Subset 2]
    P --> S3[Subset 3]
    S1 --> D1[Draft plus Rationale from Small Model]
    S2 --> D2[Draft plus Rationale from Small Model]
    S3 --> D3[Draft plus Rationale from Small Model]
    D1 --> V[Verify Each Draft with Large Model]
    D2 --> V
    D3 --> V
    V --> B[Return Highest-Confidence Draft]

Where it fits:

Main risk:

Multimodal RAG

How it works:

Embedding strategies by modality:

Modality Approach Tradeoff
Text Standard text embeddings Mature and fast. Works with any vector DB
Tables Whole-table as markdown or HTML with text embedding, or vision model reads table image directly Vision avoids OCR errors but costs significantly more per table
Images ColPali for page-level retrieval preserving layout, or CLIP for image-text alignment ColPali uses multi-vector per page and needs custom indexing. CLIP uses single vector and works with standard DBs
Mixed pages Semantic chunking with modality markers grouping related text and image and table into one unit Keeps cross-modal context together but increases chunk size

Chunking rules by modality:

Cross-modal alignment: generate potential queries from each chunk regardless of modality. A table and a chart can both answer "revenue growth" even though their raw formats differ. At retrieval time, match the user query against these generated queries rather than raw chunk content.

Where it fits:

Main risk:

Pattern Selection Guide

Pattern Best For Runtime Latency Setup Effort Runtime Cost When to Skip
Iterative Retrieval Multi-hop questions High — multiple retrieval round-trips Low — works with existing retriever High — multiple LLM calls per query Simple single-hop lookups
Self-RAG Adaptive retrieval and hallucination control Medium — parallel passage scoring High — requires custom model training Medium — single model with reflection tokens Cannot fine-tune models
CRAG Noisy retrievers and web fallback Medium — evaluator plus optional web search Medium — train or configure evaluator Medium — evaluator inference plus occasional web API Retriever already has high precision
Graph RAG Relational and connected-fact queries Medium — community lookup plus generation High — entity extraction and graph construction High — many community summaries and partial answers in map-reduce Simple fact lookups or frequently changing data
Agentic RAG Multi-source orchestration High — multi-turn reasoning loop Medium — define tools and routing High — many LLM calls per query Single data source is sufficient
Adaptive RAG Mixed-complexity query traffic Low to High — depends on routed strategy Medium — train or configure classifier Low to High — saves on simple queries Uniform query complexity
Speculative RAG Latency-sensitive with conflict detection Low — parallel drafting reduces wall-clock time High — fine-tune specialist drafter Medium — parallel small-model calls plus verifier Low query volume
Multimodal RAG Tables and images and mixed-format docs Medium to High — vision model inference Medium — modality-aware chunking pipeline Medium to High — vision embeddings and models Text-only corpus

Adoption order: start with baseline single-pass RAG. Add CRAG or Adaptive RAG first (lowest integration effort). Move to Iterative or Graph RAG when metrics plateau on multi-hop queries. Add Agentic RAG only when multiple data sources are required.

Operational Baselines

RAG vs Fine-Tuning

RAG and fine-tuning optimize different parts of the system. RAG externalizes knowledge into retrievable sources, while fine-tuning changes model behavior in weights. Choosing correctly prevents expensive retraining for problems that retrieval can solve more safely.

Example: if product policy changes weekly, RAG can update by reindexing documents. Fine-tuning would require repeated retraining cycles and still provide weak source traceability.

Axis RAG Fine-tuning
Knowledge freshness High Low
Source traceability High Low
Behavioral consistency Medium High
Time to first value Faster Slower
Operational complexity Retrieval and index ops Training and eval and release ops

Decision rules:

  1. Start with RAG when facts change often or citation is required.
  2. Add fine-tuning when output style or policy behavior remains unstable after prompt and retrieval tuning.
  3. Keep mutable facts in retrieval; keep behavior patterns in fine-tuned weights.

The combined pattern — fine-tune the model for behavior (format, tone, refusal policy) and use RAG for current factual knowledge — keeps updates fast while preserving behavioral control.

Questions

References


Whats next