Hallucinations

Intro

Hallucination is a correctness failure where an LLM output sounds fluent and confident but is not supported by evidence or reality. The mechanism matters: the model optimizes next-token likelihood, not truth, so it can produce a high-probability continuation even when the underlying claim is false. Three root causes show up repeatedly in production. Training data gaps leave weak signal for rare entities and post-cutoff facts, so the model fills missing details with plausible fabrication. RLHF reward misalignment can push the model toward convincing and agreeable answers over accurate ones. Decoding randomness at higher temperature amplifies low-probability token paths that inject invented specifics.

flowchart TD
    A[Query] --> B[Model generates claim]
    B --> C{Claim supported by context}
    C -->|Yes| D[Grounded]
    C -->|No| E[Hallucination]

Concrete example: if your retrieved context says Austen wrote Pride and Prejudice and the model answers Dickens, the response is fluent but wrong. See Generation for how sampling and structure constraints influence this behavior.

Intrinsic vs Extrinsic

Ji et al. (2022) split hallucinations into two operational classes. Intrinsic hallucination contradicts facts already present in supplied context, such as claiming Dickens wrote Pride and Prejudice when the source states Austen. This is detectable with source-output comparison, commonly via NLI entailment checks. Extrinsic hallucination adds facts not present in source material, such as adding a completion year not in context; it may be true or false, but it is unsupported by provided evidence. Extrinsic errors are harder to detect because they require external verification, not only context alignment.

Detection

Use multiple detectors because each catches different failure modes.

NLI-based fact checking: decompose an answer into claims, then score each claim against source context as entailed, neutral, or contradicted. This is strong on intrinsic hallucinations where contradictions are explicit in context. Azure AI Content Safety Groundedness detection provides this as a managed path; lightweight open-source NLI classifiers offer a self-hosted alternative.
Self-consistency (SelfCheckGPT): sample the same prompt multiple times and compare outputs. If the model has stable knowledge, core claims remain consistent; high variance and contradictions indicate potential hallucination. This is zero-resource and black-box (no logprobs or external KB), but it adds 3-5 extra inference calls.
LLM-as-judge: score answer faithfulness against context using an evaluator LLM. Common metric: faithfulness = supported claims divided by total claims. Frameworks like RAGAS automate this decomposition.
Atomic fact verification (FActScore): break text into atomic facts, retrieve evidence from a knowledge base, and validate each fact independently. This gives granular failure localization; on biography generation benchmarks, models score around 58% FActScore, illustrating how frequently atomic claims lack support.

For RAG stacks, pair these with Evaluation so retrieval quality and answer faithfulness are measured separately.

Mitigation

Start with grounding, then add targeted controls where risk justifies cost.

Retrieval grounding (RAG): move from memory recall to source summarization. This is usually the single biggest reduction in fabricated claims because it gives explicit evidence boundaries. It is not a hard guarantee: RAG-based legal tools still report hallucination rates above 17%, so treat grounding as risk reduction, not elimination. See RAG.
Chain-of-Verification (CoVe): run a factored loop of generate answer, plan verification questions, answer verification questions independently without original draft context, then revise. Independent verification interrupts the feedback loop where the model reuses its own hallucinated tokens as if they were evidence.
Structured output with constrained decoding: enforce schema, enums, and field contracts so the model cannot invent arbitrary free-form structures. This shrinks the space of possible fabrications and is especially useful for downstream automation.
Abstention policy: define a strict fallback phrase (for example, "I do not have enough evidence in the provided context") when evidence is insufficient. Explicit abstention is safer than confident guessing in high-stakes flows.
Tool-augmented generation: route factual subproblems to tools (databases, calculators, APIs) and have the model synthesize tool outputs instead of inventing unsupported details.

In practice, combine these with Guardrails so abstention, citation behavior, and output validation are enforced consistently.

Pitfalls

RAG Does Not Eliminate Hallucinations

What goes wrong: teams ship RAG and assume hallucination is solved, then stop active monitoring.
Why it happens: RAG introduces its own failure modes: retrieval miss, context overflow, and model additions beyond retrieved evidence.
How to avoid or detect it: track retrieval recall and faithfulness separately; keep claim-to-context verification in place even after RAG rollout. Stanford and Yale findings on legal RAG tools (>17% hallucination) are the practical warning signal.

RLHF Makes Factuality Worse

What goes wrong: model quality looks better to users while factual precision degrades.
Why it happens: human preference signals reward confidence, detail, and agreeableness; RLHF then optimizes approval, not truth.
How to avoid or detect it: include factuality-aware reward signals (for example FActScore-style objectives in preference optimization) and monitor calibration, not only user satisfaction. Reported RLHF rollbacks due to sycophancy are concrete examples of reward signals overpowering factuality safeguards.

Over-Aggressive Mitigation Causes Over-Refusal

What goes wrong: the system refuses answerable questions, hedges excessively, or returns partial responses.
Why it happens: aggressive abstention or safety tuning shifts the model from fabrication risk to under-answering risk.
How to avoid or detect it: calibrate refusal thresholds by domain; evaluate faithfulness and helpfulness together, not independently. There is no universal optimum, only a domain-specific operating point.

Tradeoffs

Approach	Hallucination reduction	Cost	Latency impact	Risk
RAG grounding	High -- shifts to summarization	Medium -- retrieval infra + embedding cost	+100-500ms retrieval	Retrieval failures become silent hallucination source
Self-consistency	Medium -- catches extrinsic	High -- 3-5x inference cost	3-5x latency	Misses intrinsic hallucinations
NLI fact checking	Medium-High -- catches intrinsic	Low -- lightweight model	+50-100ms per claim	NLI model has its own error rate
LLM-as-judge	High -- semantic evaluation	Medium -- judge inference cost	+1-3s per response	Judge can itself hallucinate
Constrained output	Low-Medium -- limits format	Low -- built into decoding	Minimal	Only prevents structural fabrication, not factual
Abstention policy	Variable -- depends on calibration	None -- prompt change only	None	Over-refusal degrades helpfulness

Decision rule: use RAG grounding + NLI fact checking as baseline. Add self-consistency only for high-stakes flows where latency budget allows it. Use LLM-as-judge primarily for offline evaluation, not as a strict real-time gate.

Questions

Why can RAG-grounded systems still hallucinate significantly?

RAG changes the task to summarizing retrieved evidence, but generation can still add unsupported claims beyond context.
The model can misread or incorrectly compose facts from valid passages.
Retrieval failures silently cap answer quality before generation starts.
Legal RAG tools reporting >17% hallucination shows the gap between grounding and guaranteed correctness.
Tradeoff: RAG adds retrieval and embedding cost but reduces rather than eliminates hallucination; invest according to the cost of undetected fabrication.

Why does RLHF increase hallucination risk while perceived quality improves?

Human raters generally reward confidence, verbosity, and polished style.
RLHF optimizes approval signals, so "sounds good" can outrank "is true."
The model becomes more overconfident on wrong answers, which worsens calibration.
Factuality-aware optimization (for example FActScore-informed preference training) counterbalances this failure mode.
Tradeoff: RLHF improves engagement and instruction-following but can hurt factual reliability unless factual rewards are part of training.

How do you separate retrieval failure from generation hallucination in a RAG pipeline?

Check corpus coverage first: does the needed document exist at all.
Check retrieval recall next: if present, was it retrieved for this query.
Check claim traceability: can each answer claim be grounded to retrieved passages.
Not retrieved implies retrieval failure (fix chunking, embeddings, ranking); retrieved but unsupported claims imply generation hallucination (fix grounding prompt and verification).
Tradeoff: attribution pipelines add NLI-per-claim cost and latency, but they prevent wasted effort by isolating the true bottleneck.

References

Survey of hallucination in natural language generation -- canonical intrinsic and extrinsic taxonomy (Ji et al., ACM Computing Surveys 2022) - Anchor survey that defines the widely used taxonomy and detection framing.
FActScore -- fine-grained atomic evaluation of factual precision in text generation (Min et al., EMNLP 2023) - Introduces atomic-fact factuality measurement and reports baseline model behavior.
SelfCheckGPT -- zero-resource black-box hallucination detection (Manakul et al., EMNLP 2023) - Practical self-consistency method that does not require external knowledge bases.
Towards understanding sycophancy in language models -- RLHF reward misalignment (Sharma et al., Anthropic, ICLR 2024) - Explains why preference optimization can push models toward agreement over correctness.
Groundedness detection -- NLI-based claim verification as a managed service (Azure AI Content Safety) - Official service documentation for production groundedness checks.
Reduce hallucinations -- grounding, citations, and abstention patterns (Anthropic Docs) - Practice-oriented guardrail patterns for grounded generation.
Chain-of-Verification reduces hallucination in LLMs -- factored verification methodology (Dhuliawala et al., Meta AI 2023) - Core paper for generate-verify-revise decomposition.
Hallucination in RAG-based legal AI tools -- Stanford and Yale study finding over 17% rate (Magesh et al., JELS 2025) - Domain-specific evidence that RAG meaningfully reduces but does not remove hallucinations.
Extrinsic hallucinations in LLMs -- mechanistic causes and mitigation survey (Lilian Weng, July 2024) - Mechanism-focused practitioner synthesis with concrete mitigation patterns.

Whats next

Parent
11 AI & ML

Topics

Pages