Hallucinations

Intro

Hallucination is a correctness failure where an LLM output sounds fluent and confident but is not supported by evidence or reality. The mechanism matters: the model optimizes next-token likelihood, not truth, so it can produce a high-probability continuation even when the underlying claim is false. Three root causes show up repeatedly in production. Training data gaps leave weak signal for rare entities and post-cutoff facts, so the model fills missing details with plausible fabrication. RLHF reward misalignment can push the model toward convincing and agreeable answers over accurate ones. Decoding randomness at higher temperature amplifies low-probability token paths that inject invented specifics.

flowchart TD
    A[Query] --> B[Model generates claim]
    B --> C{Claim supported by context}
    C -->|Yes| D[Grounded]
    C -->|No| E[Hallucination]

Concrete example: if your retrieved context says Austen wrote Pride and Prejudice and the model answers Dickens, the response is fluent but wrong. See Generation for how sampling and structure constraints influence this behavior.

Intrinsic vs Extrinsic

Ji et al. (2022) split hallucinations into two operational classes. Intrinsic hallucination contradicts facts already present in supplied context, such as claiming Dickens wrote Pride and Prejudice when the source states Austen. This is detectable with source-output comparison, commonly via NLI entailment checks. Extrinsic hallucination adds facts not present in source material, such as adding a completion year not in context; it may be true or false, but it is unsupported by provided evidence. Extrinsic errors are harder to detect because they require external verification, not only context alignment.

Detection

Use multiple detectors because each catches different failure modes.

For RAG stacks, pair these with Evaluation so retrieval quality and answer faithfulness are measured separately.

Mitigation

Start with grounding, then add targeted controls where risk justifies cost.

In practice, combine these with Guardrails so abstention, citation behavior, and output validation are enforced consistently.

Pitfalls

RAG Does Not Eliminate Hallucinations

RLHF Makes Factuality Worse

Over-Aggressive Mitigation Causes Over-Refusal

Tradeoffs

Approach Hallucination reduction Cost Latency impact Risk
RAG grounding High -- shifts to summarization Medium -- retrieval infra + embedding cost +100-500ms retrieval Retrieval failures become silent hallucination source
Self-consistency Medium -- catches extrinsic High -- 3-5x inference cost 3-5x latency Misses intrinsic hallucinations
NLI fact checking Medium-High -- catches intrinsic Low -- lightweight model +50-100ms per claim NLI model has its own error rate
LLM-as-judge High -- semantic evaluation Medium -- judge inference cost +1-3s per response Judge can itself hallucinate
Constrained output Low-Medium -- limits format Low -- built into decoding Minimal Only prevents structural fabrication, not factual
Abstention policy Variable -- depends on calibration None -- prompt change only None Over-refusal degrades helpfulness

Decision rule: use RAG grounding + NLI fact checking as baseline. Add self-consistency only for high-stakes flows where latency budget allows it. Use LLM-as-judge primarily for offline evaluation, not as a strict real-time gate.

Questions

References


Whats next