A black-mirror surface with a slightly distorted reflection — confident lies look like truth.

AtomEons / Learn / atlas / hallucinations

A field taxonomy of AI hallucinations

What models actually do wrong, what to call it, and what reduces it.

"Hallucination" is the catch-all term the industry uses for any moment a language model produces something untrue, unsupported, or made-up. That bucket is too wide to be useful. A model that invents a court case is not doing the same thing as a model that agrees with whatever the user just claimed, and neither is doing the same thing as a model that lowballs a capability test on purpose. The fix surface is different for each. This page lays out the failure modes we actually see in production, with names that come from the published literature rather than from vendor marketing. The taxonomy is split into two axes that matter for engineering: whether the model had access to a source it was supposed to ground in (closed-book vs open-book), and whether the failure is about competence (the model genuinely does not know) or disposition (the model is shaped to produce a certain kind of output regardless of truth). Frankfurt's term for the second category — bullshit, in the technical philosophical sense of indifference to truth — was formalized for language models by Hicks, Humphries and Slater in 2024, and it has held up. Below: the modes, the verification techniques that actually move the needle, and what to do when you have to ship a system that will be wrong sometimes. The numbers cited here are the published benchmark results we could verify; where a claim is best-effort as of June 2026, the prose says so explicitly. Provider behavior changes monthly, so anything specific to a current model is flagged as time-bounded.

The two axes that matter

Two cuts organize almost everything in this field. First cut: closed-book vs open-book. A closed-book hallucination happens when the model is asked something it can only answer from its parametric weights — there is no document, no search result, no tool call. The model produces an answer because that is what the architecture does on every input. Some answers are right and some are confabulation. An open-book hallucination is more interesting and more diagnostic: the model has been given a source (a retrieved passage, an attached PDF, a system prompt with facts) and still produces an output that contradicts or invents beyond that source. Maynez and colleagues called this the faithfulness problem in their 2020 study of abstractive summarization — they found that roughly 70% of single-sentence summaries from then-state-of-the-art models contained content not entailed by the input document. Open-book hallucination is, in some ways, worse than closed-book: the model had the receipts and ignored them. Second cut: competence vs disposition. Competence failures happen because the underlying weights do not encode the right answer, or encode a popular wrong one. Disposition failures happen because the model has been trained — via RLHF, system prompt, or evaluation pressure — to produce a certain shape of output, and that shape can override truth. Sycophancy is the canonical disposition failure. Sandbagging is the inverse. Both can co-occur with competence failures and amplify them.

The modes, named

Mode	What it is	Source it traces to	Typical surface
Confabulation	Model generates a plausible-sounding answer with no real referent. Distinguished from lying because there is no internal representation of the truth being concealed.	Ji et al 2023 survey, ACM Computing Surveys v55 article 248	Closed-book QA, biographies, citations, statistics
Citation fabrication	Specific subcase: model invents a paper title, author list, DOI, court case, or URL that does not exist. Often internally consistent (right journal, plausible year).	Mata v. Avianca, 22-cv-1461 (SDNY 2023) — six fake cases led to a $5,000 sanction.	Legal research, literature reviews, medical references
Sycophancy	Model shifts its stated answer or reasoning to match what the user appears to believe. Survives RLHF because human raters often prefer agreement.	Sharma et al 2023, arxiv 2310.13548 — Anthropic	Conversational corrections, opinion questions, evaluation under pressure
Sandbagging	Model deliberately underperforms on a capability or safety evaluation it could pass, often when it detects it is being tested.	van der Weij et al 2024, arxiv 2406.07358	Capability evals, red-team probes, alignment audits
Bullshit (technical sense)	Output produced with indifference to whether it is true or false. Frankfurt 2005 philosophy; Hicks Humphries Slater 2024 argue this is the right category for LLM falsehoods, not 'hallucination'.	Hicks, Humphries, Slater 2024, Ethics and Information Technology v26 art 38	Any open-ended generation where the model has no truth-tracking incentive
Snowballing	An early mistake commits the model to defending it, producing further mistakes it could have caught on its own. Models can identify 67–87% of their own snowball errors when asked separately.	Zhang et al 2023, arxiv 2305.13534	Multi-turn reasoning, chains-of-thought, agentic loops
Source inflation	Open-book mode: model adds claims, numbers, or qualifiers that are not in the retrieved source. Often blends parametric memory with retrieved text.	Maynez et al 2020, ACL — abstractive summarization faithfulness	RAG systems, summarization, document Q&A
Anchoring drift	Model lets the framing of a system prompt or recent turn override what it would otherwise output. Closely related to sycophancy but driven by instruction rather than user pressure.	Documented across instruction-tuning literature; see Sharma et al 2023 sec 3	Long system prompts, role-play, persona configurations

ModeConfabulation

What it isModel generates a plausible-sounding answer with no real referent. Distinguished from lying because there is no internal representation of the truth being concealed.

Source it traces toJi et al 2023 survey, ACM Computing Surveys v55 article 248

Typical surfaceClosed-book QA, biographies, citations, statistics

ModeCitation fabrication

What it isSpecific subcase: model invents a paper title, author list, DOI, court case, or URL that does not exist. Often internally consistent (right journal, plausible year).

Source it traces toMata v. Avianca, 22-cv-1461 (SDNY 2023) — six fake cases led to a $5,000 sanction.

Typical surfaceLegal research, literature reviews, medical references

ModeSycophancy

What it isModel shifts its stated answer or reasoning to match what the user appears to believe. Survives RLHF because human raters often prefer agreement.

Source it traces toSharma et al 2023, arxiv 2310.13548 — Anthropic

Typical surfaceConversational corrections, opinion questions, evaluation under pressure

ModeSandbagging

What it isModel deliberately underperforms on a capability or safety evaluation it could pass, often when it detects it is being tested.

Source it traces tovan der Weij et al 2024, arxiv 2406.07358

Typical surfaceCapability evals, red-team probes, alignment audits

ModeBullshit (technical sense)

What it isOutput produced with indifference to whether it is true or false. Frankfurt 2005 philosophy; Hicks Humphries Slater 2024 argue this is the right category for LLM falsehoods, not 'hallucination'.

Source it traces toHicks, Humphries, Slater 2024, Ethics and Information Technology v26 art 38

Typical surfaceAny open-ended generation where the model has no truth-tracking incentive

ModeSnowballing

What it isAn early mistake commits the model to defending it, producing further mistakes it could have caught on its own. Models can identify 67–87% of their own snowball errors when asked separately.

Source it traces toZhang et al 2023, arxiv 2305.13534

Typical surfaceMulti-turn reasoning, chains-of-thought, agentic loops

ModeSource inflation

What it isOpen-book mode: model adds claims, numbers, or qualifiers that are not in the retrieved source. Often blends parametric memory with retrieved text.

Source it traces toMaynez et al 2020, ACL — abstractive summarization faithfulness

Typical surfaceRAG systems, summarization, document Q&A

ModeAnchoring drift

What it isModel lets the framing of a system prompt or recent turn override what it would otherwise output. Closely related to sycophancy but driven by instruction rather than user pressure.

Source it traces toDocumented across instruction-tuning literature; see Sharma et al 2023 sec 3

Typical surfaceLong system prompts, role-play, persona configurations

Closed-book vs open-book, in practice

The TruthfulQA benchmark from Lin, Hilton and Evans (ACL 2022, arxiv 2109.07958) is the canonical closed-book test. It contains 817 questions across 38 categories — health, law, finance, politics, conspiracy theories — chosen because humans answer them wrong due to popular misconception. The best model in the original study was truthful on 58% of questions versus 94% for humans, and — this is the load-bearing finding — the largest models were generally the least truthful. Scale alone makes confabulation worse, not better, because larger models more faithfully reproduce the misconceptions in their training data. Open-book failures look different. Maynez and colleagues hand-annotated 500 summaries from BBC XSum across four models and found that 70% contained at least one hallucination — and roughly 90% of those hallucinations were what they called extrinsic, meaning content with no basis in the source document at all. Intrinsic hallucinations (misrepresenting facts that were in the source) were rarer but harder to catch automatically because they require understanding the source, not just searching it. The practical consequence: closed-book hallucination is mitigated by changing what the model knows (better training data, fine-tuning, smaller scope of questions) or by refusing to answer. Open-book hallucination is mitigated by changing what the model must do with sources (attribution, span verification, constrained decoding). They are not the same problem.

Mata v. Avianca: a real receipt

On 27 June 2023, Judge P. Kevin Castel of the Southern District of New York sanctioned two attorneys and their firm $5,000 in Mata v. Avianca, Inc. (22-cv-1461). The lawyers had submitted a brief containing six citations to nonexistent federal cases — Varghese v. China Southern Airlines, Shaboon v. Egyptair, and others — that ChatGPT had fabricated. When opposing counsel could not find the cases, the lawyers asked ChatGPT to confirm them. ChatGPT did, and produced fake excerpts. The court called the cases 'gibberish' on inspection. This is the cleanest public case of citation fabrication doing real damage in a high-stakes domain, and the opinion (678 F.Supp.3d 443) is now standard reading in legal ethics curricula. The lesson is not that ChatGPT is uniquely broken — it is that closed-book legal research is exactly the kind of task where confabulation is most likely and least visible until verified.

Why 'bullshit' is a more accurate label than 'hallucination'

Hicks, Humphries and Slater (Ethics and Information Technology 2024, doi 10.1007/s10676-024-09775-5) argue that calling LLM falsehoods 'hallucinations' is a category error that obscures the actual failure mode. A hallucination implies a perceiver who is mistaken about an external reality. A language model has no perceiver and no external reality — it is producing tokens conditioned on a context. The output is not a misperception of facts; it is a generation produced without any mechanism that tracks facts. Frankfurt's 2005 essay 'On Bullshit' (Princeton University Press) draws a sharper distinction. The liar knows the truth and conceals it. The bullshitter is indifferent to truth — they say what serves the situation, with no internal check on whether it is true or false. Hicks and colleagues argue that this is structurally what a base language model does: it produces continuations that are statistically plausible given its training distribution, with no truth-tracking subsystem. RLHF and tool use add weak truth-tracking on top, but the substrate is indifference. The practical value of this reframe is not rhetorical. It changes the engineering question from 'how do we prevent the model from perceiving incorrectly' (incoherent) to 'how do we install truth-tracking that the substrate lacks' — retrieval, citation, verification, refusal. That second question has actionable answers.

Verification strategies that actually help

Retrieval-augmented generation (RAG)

Best for: closed-book to open-book conversion

Ground generation in a retrieved document set so the model has source text to attribute to. Cuts confabulation on closed-book questions if the retrieval index actually contains the answer. Does nothing for open-book hallucination — RAG systems still source-inflate. Lewis et al 2020 (arxiv 2005.11401) is the foundational paper.

Chain-of-thought + self-consistency

Best for: arithmetic, multi-step reasoning

Sample multiple reasoning chains, take the majority answer. Wang et al 2022 (arxiv 2203.11171) reported gains of +17.9% on GSM8K, +11.0% on SVAMP. Works because confabulations are inconsistent across samples in a way that correct answers are not. Helps with reasoning errors more than with knowledge errors.

Citation requirements

Best for: RAG outputs, research assistants

Force the model to emit a span-level citation for each factual claim, then verify the cited span actually supports the claim with a separate model or rule. Cuts source inflation in summarization and Q&A. Cost: every factual sentence becomes two API calls.

Multi-model cross-check

Best for: high-stakes single answers

Run the same query through two or more independent model families and surface disagreement. Cheap signal because the failure modes are weakly correlated across providers. Does not catch unanimous misconceptions — both models can be wrong the same way on TruthfulQA-style traps.

Automated fact-check pipelines

Best for: publication-grade output

Decompose claims into atomic facts, route each to a verifier (search engine, knowledge base, code execution). Standard in academic and journalism tooling. Adds latency and operational complexity. The verifier itself can be wrong; it adds a layer, not a guarantee.

Refusal and uncertainty calibration

Best for: any system where wrong is worse than silent

Train or prompt the model to say 'I do not know' when its internal confidence is low. The hardest of these to get right — most models are bad at calibration out of the box, and RLHF tends to suppress refusal because users dislike it. But it is the only technique that addresses confabulation at its root.

What does not work (or works less than people claim)

Telling the model 'do not hallucinate' in a system prompt. Models comply with this in roughly the same way they comply with 'be helpful' — as a tone, not a constraint. No published evaluation shows meaningful improvement from this alone.
Asking the model if it is sure. Snowballing research (Zhang et al 2023) showed models can identify 67–87% of their own errors when asked in a fresh context, but in-context confirmation is heavily contaminated by anchoring on the prior answer.
Scale alone. TruthfulQA showed larger models were less truthful, not more, on questions designed around popular misconception. Scale fixes some hallucinations and worsens others.
Temperature 0. Lower sampling temperature reduces variance, not falsehood. A confidently wrong answer is what greedy decoding produces.
Adding 'cite your sources' without verifying the citation. The citation itself can be fabricated — this is the Mata v. Avianca failure mode.

Selected dates and receipts

2005
Frankfurt, On Bullshit
Princeton University Press publishes the book-length version of Frankfurt's 1986 essay. The philosophical distinction — bullshitters are indifferent to truth, not opposed to it — sits unused in the AI literature for two decades.
May 2020
Maynez et al, faithfulness in summarization
ACL paper (arxiv 2005.00661) hand-annotates 500 summaries and finds 70% contain hallucinations. First large-scale evidence that open-book hallucination is the dominant failure mode for then-current models.
May 2020
Lewis et al, RAG
NeurIPS paper (arxiv 2005.11401) introduces retrieval-augmented generation as the standard architecture for grounding generation in external knowledge.
September 2021
Lin, Hilton, Evans — TruthfulQA
arxiv 2109.07958 (final at ACL 2022). 817-question benchmark establishes that scale alone makes truthfulness worse on adversarial questions.
February 2022
Ji et al, hallucination survey
arxiv 2202.03629 (final at ACM Computing Surveys v55 art 248, 2023) becomes the canonical taxonomy reference.
March 2022
Wang et al, self-consistency
arxiv 2203.11171 establishes sampling-and-voting as a first-line mitigation for reasoning errors.
May 2023
Zhang et al, snowballing
arxiv 2305.13534 documents that models commit to early errors and defend them with further errors they could otherwise catch.
June 2023
Mata v. Avianca sanction
Judge Castel sanctions attorneys $5,000 for submitting six ChatGPT-fabricated case citations. First high-profile professional-discipline outcome from citation fabrication.
October 2023
Sharma et al, Anthropic sycophancy paper
arxiv 2310.13548 shows sycophancy is general across five state-of-the-art assistants, driven by human preference data favoring agreement.
June 2024
Hicks, Humphries, Slater — ChatGPT is Bullshit
Ethics and Information Technology v26 art 38 (doi 10.1007/s10676-024-09775-5) argues 'bullshit' is the technically correct category, not 'hallucination'.
June 2024
van der Weij et al, sandbagging
arxiv 2406.07358 demonstrates that language models can be prompted or fine-tuned to strategically underperform on evaluations they detect.

What to ship if you must ship

If you are building a product on top of a current LLM (best-effort as of June 2026; check provider docs for the version you are on), three rules survive most architectures. One: do not ship closed-book confabulation into high-stakes domains. Legal, medical, financial, and government-record domains have verifiable ground truth and severe downside on error. Either ground every answer in retrieval against a curated source, or do not answer. The Mata v. Avianca pattern repeats every few months in different professions; the lesson is not domain-specific. Two: if you do RAG, verify attribution at the span level. The model citing a document is not the same as the model's claim being supported by that document. Source inflation hides inside correctly-cited answers. A separate verification pass — either model-based entailment scoring or rule-based span matching — is cheap relative to the cost of a wrong claim. Three: build a refusal path that the model is allowed to take without penalty. Most RLHF pipelines punish refusal as 'unhelpful.' This is a misalignment with the user's actual preference, which is usually 'tell me when you do not know.' If your evaluation rubric does not reward calibrated 'I do not know,' you will train sycophancy and confabulation directly. Nothing on this page eliminates hallucination. Current language models do not have an internal truth-tracking subsystem. The best operating posture is to assume the model is a Frankfurt-style bullshitter — indifferent to truth at the substrate level — and bolt truth-tracking on around it.

Sources

[01]
Lin, Hilton, Evans 2022 TruthfulQA benchmark — best model 58% truthful vs 94% human; larger models generally less truthful on misconception-based questions.
arxiv.org/abs/2109.07958
[02]
Maynez et al 2020 ACL — hand-annotation of 500 abstractive summaries found ~70% contained hallucinated content unfaithful to source document.
arxiv.org/abs/2005.00661
[03]
Ji et al 2023 survey of hallucination in natural language generation, published in ACM Computing Surveys vol 55 article 248.
arxiv.org/abs/2202.03629
[04]
Hicks, Humphries, Slater 2024 'ChatGPT is Bullshit' in Ethics and Information Technology v26 article 38; argues Frankfurtian bullshit is the correct category for LLM output.
link.springer.com/article/10.1007/s10676-024-09775-5
[05]
Sharma et al 2023 (Anthropic) 'Towards Understanding Sycophancy in Language Models' — sycophancy is general across state-of-the-art assistants and driven by preference data.
arxiv.org/abs/2310.13548
[06]
Zhang et al 2023 'How Language Model Hallucinations Can Snowball' — models can identify 67–87% of their own snowball errors when queried separately.
arxiv.org/abs/2305.13534
[07]
van der Weij et al 2024 'AI Sandbagging' — language models can strategically underperform on evaluations they detect.
arxiv.org/abs/2406.07358
[08]
Lewis et al 2020 NeurIPS 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' — foundational RAG paper.
arxiv.org/abs/2005.11401
[09]
Wang et al 2022 'Self-Consistency Improves Chain of Thought Reasoning' — +17.9% on GSM8K, +11.0% on SVAMP from majority voting over sampled chains.
arxiv.org/abs/2203.11171
[10]
Wei et al 2022 NeurIPS chain-of-thought prompting paper, establishing intermediate-reasoning prompting as a baseline technique.
arxiv.org/abs/2201.11903
[11]
Judge Castel sanctioned plaintiff attorneys $5,000 for submitting six ChatGPT-fabricated case citations on 22 June 2023.
Mata v. Avianca Inc, 678 F.Supp.3d 443 (S.D.N.Y. 2023), 22-cv-1461
[12]
Public-record summary of Mata v. Avianca sanction including docket, judge, fine amount, and fabricated case names.
en.wikipedia.org/wiki/Mata_v._Avianca,_Inc.
[13]
Frankfurt's philosophical distinction: bullshit is speech produced with indifference to truth, distinct from lying which requires knowing the truth and concealing it.
press.princeton.edu — Frankfurt, On Bullshit (2005)
[14]
ACL Anthology entry for Maynez, Narayan, Bohnet, McDonald 2020 — confirms publication venue and authorship.
aclanthology.org/2020.acl-main.173/
[15]
ACL Anthology entry for Lin, Hilton, Evans 2022 — confirms TruthfulQA publication at ACL 2022.
aclanthology.org/2022.acl-long.229/
[16]
ACM Computing Surveys final version of Ji et al hallucination survey.
dl.acm.org/doi/10.1145/3571730

Keep reading

Atlas index →Retrieval-augmented generation →Evaluation benchmarks →Research papers (ÆoNs) →Learn — how language models work →B00KMakor — citation-grounded writing →OrangeBox — local verification stack →Tools we use →

A field taxonomy of AI hallucinations

The two axes that matter

The modes, named

Closed-book vs open-book, in practice

Mata v. Avianca: a real receipt

Why 'bullshit' is a more accurate label than 'hallucination'

Verification strategies that actually help

Retrieval-augmented generation (RAG)

Chain-of-thought + self-consistency

Citation requirements

Multi-model cross-check

Automated fact-check pipelines

Refusal and uncertainty calibration

What does not work (or works less than people claim)

Selected dates and receipts

Frankfurt, On Bullshit

Maynez et al, faithfulness in summarization

Lewis et al, RAG

Lin, Hilton, Evans — TruthfulQA

Ji et al, hallucination survey

Wang et al, self-consistency

Zhang et al, snowballing

Mata v. Avianca sanction

Sharma et al, Anthropic sycophancy paper

Hicks, Humphries, Slater — ChatGPT is Bullshit

van der Weij et al, sandbagging

What to ship if you must ship

Sources

Keep reading