Chunking

Intro

Chunking decides the unit of retrieval. If chunks are too wide, retrieval returns noisy context that dilutes the answer; if chunks are too narrow, critical constraints get split across fragments and the model answers from incomplete evidence. The goal is to create chunks that are semantically coherent, operationally efficient, and traceable back to their source section.

Example: a policy doc states "Keep logs for 90 days. Exception: security investigations require 365 days." Splitting by raw character count can place the rule and exception in separate chunks. A retriever pulls only the first chunk, and the model answers without the exception clause — factually wrong, silently confident.

How to Choose a Strategy

Use retrieval failures, not preference, to pick a strategy.

What you observe in evaluation Start with Why Upgrade trigger
Split clauses, tables, or code blocks break answer correctness Structure-aware Preserves logical units and improves citation traceability Parser misses important layouts or ingestion cost gets too high
Answers miss adjacent constraints even when relevant docs are found Parent-child Keeps retrieval precise while restoring broader synthesis context Storage and orchestration overhead outweigh quality gains
Queries drift across topics inside long prose Semantic Places boundaries around topic shifts Ingestion becomes too slow or unstable across model updates
Need a fast, predictable baseline now Recursive Better boundaries than fixed-size with low implementation cost Quality plateaus on mixed-format or highly structured corpora
Tight latency and simple homogeneous corpus Fixed-size Most operationally predictable ingestion Faithfulness drops from boundary cuts

Chunking Strategies

Fixed-Size Chunking

How it works:

flowchart TD
  S[Source -- Keep logs 90 days -- Exception 365 days for security] --> SP[Fixed-size split at 200 tokens]
  SP --> C1[Chunk 1 -- Keep logs 90 days]
  SP --> C2[Chunk 2 -- Exception 365 days for security]
  C1 -. retrieval finds only chunk 1 .-> W[Answer misses exception clause]

Where it fits:

Main risk:

Recursive Chunking

How it works:

flowchart TD
  D[Document] --> S1{Split by section breaks}
  S1 -->|Fits target| Done[Keep as one chunk]
  S1 -->|Too large| S2{Split by paragraph breaks}
  S2 -->|Fits target| Done
  S2 -->|Too large| S3{Split by sentence breaks}
  S3 --> Done

Where it fits:

Main risk:

Structure-Aware Chunking

How it works:

flowchart TD
  D[Markdown document] --> P[Structure parser]
  P --> C1[Heading + prose -- Data Retention then Policy Rules]
  P --> C2[Full table -- Data Retention then Retention Periods]
  P --> C3[Code block + docstring -- Data Retention then Implementation]

Where it fits:

Main risk:

Semantic Chunking

How it works:

flowchart LR
  A[Span 1 -- sim 0.92] --> B[Span 2 -- sim 0.89]
  B --> C[Span 3 -- sim 0.91]
  C -->|sim drops to 0.43| D[Topic shift -- split here]
  D --> E[Span 5 -- sim 0.88]
  E --> F[Span 6 -- sim 0.90]

Where it fits:

Main risk:

Parent-Child Chunking

How it works:

flowchart TD
  P[Parent -- full Data Retention section -- 800 tokens] --> C1[Child -- clause 1 -- 120 tokens]
  P --> C2[Child -- clause 2 -- 90 tokens]
  P --> C3[Child -- clause 3 -- 110 tokens]
  Q[Query] --> C2
  C2 -. expand to parent .-> P
  P --> G[Generator gets full context]

Where it fits:

Main risk:

Agentic Chunking

How it works:

flowchart TD
  D[Document] --> LLM[LLM reasons about semantic intent]
  LLM --> C1[Chunk 1 -- eligibility rules]
  LLM --> C2[Chunk 2 -- exception handling]
  LLM --> C3[Chunk 3 -- audit requirements]

Where it fits:

Main risk:

Practical Baselines

Tradeoffs

Strategy Retrieval precision Ingestion cost Implementation complexity Best corpus type
Fixed-size Low — arbitrary cuts split logical units Lowest — no parsing, constant throughput Trivial — one parameter (window + overlap) Homogeneous prose with uniform structure
Recursive Medium — respects paragraph/sentence boundaries Low — rule-based, no model calls Low — configurable separator list per format Mixed prose without heavy tables or code
Structure-aware High — preserves tables, code, clauses as atomic units Medium — requires format-specific parsers Medium-High — one parser per source format, ongoing maintenance Documents where layout carries meaning (policies, API docs, contracts)
Semantic High — boundaries align with actual topic shifts High — embedding call per span during ingestion Medium — threshold tuning per corpus and embedding model Long narrative text without reliable structural markers
Parent-child High (child precision + parent context expansion) Medium — two-layer indexing, metadata mapping Medium — retrieval pipeline needs expansion and deduplication step Domains where answers depend on adjacent constraints (policy, legal, specs)
Agentic Highest — LLM reasons about semantic intent per document Highest — LLM inference per document at ingestion High — prompt versioning, non-deterministic output, caching infra High-stakes documents where chunk quality directly affects business risk

The table compares steady-state characteristics. In practice, most teams start with recursive chunking to establish baseline metrics, then upgrade specific source types based on observed retrieval failures rather than switching the entire corpus at once.

Questions

References


Whats next