
A field taxonomy of AI hallucinations
What models actually do wrong, what to call it, and what reduces it.
The two axes that matter
The modes, named
Closed-book vs open-book, in practice
Mata v. Avianca: a real receipt
On 27 June 2023, Judge P. Kevin Castel of the Southern District of New York sanctioned two attorneys and their firm $5,000 in Mata v. Avianca, Inc. (22-cv-1461). The lawyers had submitted a brief containing six citations to nonexistent federal cases — Varghese v. China Southern Airlines, Shaboon v. Egyptair, and others — that ChatGPT had fabricated. When opposing counsel could not find the cases, the lawyers asked ChatGPT to confirm them. ChatGPT did, and produced fake excerpts. The court called the cases 'gibberish' on inspection. This is the cleanest public case of citation fabrication doing real damage in a high-stakes domain, and the opinion (678 F.Supp.3d 443) is now standard reading in legal ethics curricula. The lesson is not that ChatGPT is uniquely broken — it is that closed-book legal research is exactly the kind of task where confabulation is most likely and least visible until verified.
Why 'bullshit' is a more accurate label than 'hallucination'
Verification strategies that actually help
Retrieval-augmented generation (RAG)
Best for: closed-book to open-book conversion
Ground generation in a retrieved document set so the model has source text to attribute to. Cuts confabulation on closed-book questions if the retrieval index actually contains the answer. Does nothing for open-book hallucination — RAG systems still source-inflate. Lewis et al 2020 (arxiv 2005.11401) is the foundational paper.
Chain-of-thought + self-consistency
Best for: arithmetic, multi-step reasoning
Sample multiple reasoning chains, take the majority answer. Wang et al 2022 (arxiv 2203.11171) reported gains of +17.9% on GSM8K, +11.0% on SVAMP. Works because confabulations are inconsistent across samples in a way that correct answers are not. Helps with reasoning errors more than with knowledge errors.
Citation requirements
Best for: RAG outputs, research assistants
Force the model to emit a span-level citation for each factual claim, then verify the cited span actually supports the claim with a separate model or rule. Cuts source inflation in summarization and Q&A. Cost: every factual sentence becomes two API calls.
Multi-model cross-check
Best for: high-stakes single answers
Run the same query through two or more independent model families and surface disagreement. Cheap signal because the failure modes are weakly correlated across providers. Does not catch unanimous misconceptions — both models can be wrong the same way on TruthfulQA-style traps.
Automated fact-check pipelines
Best for: publication-grade output
Decompose claims into atomic facts, route each to a verifier (search engine, knowledge base, code execution). Standard in academic and journalism tooling. Adds latency and operational complexity. The verifier itself can be wrong; it adds a layer, not a guarantee.
Refusal and uncertainty calibration
Best for: any system where wrong is worse than silent
Train or prompt the model to say 'I do not know' when its internal confidence is low. The hardest of these to get right — most models are bad at calibration out of the box, and RLHF tends to suppress refusal because users dislike it. But it is the only technique that addresses confabulation at its root.
What does not work (or works less than people claim)
- Telling the model 'do not hallucinate' in a system prompt. Models comply with this in roughly the same way they comply with 'be helpful' — as a tone, not a constraint. No published evaluation shows meaningful improvement from this alone.
- Asking the model if it is sure. Snowballing research (Zhang et al 2023) showed models can identify 67–87% of their own errors when asked in a fresh context, but in-context confirmation is heavily contaminated by anchoring on the prior answer.
- Scale alone. TruthfulQA showed larger models were less truthful, not more, on questions designed around popular misconception. Scale fixes some hallucinations and worsens others.
- Temperature 0. Lower sampling temperature reduces variance, not falsehood. A confidently wrong answer is what greedy decoding produces.
- Adding 'cite your sources' without verifying the citation. The citation itself can be fabricated — this is the Mata v. Avianca failure mode.
Selected dates and receipts
2005
Frankfurt, On Bullshit
Princeton University Press publishes the book-length version of Frankfurt's 1986 essay. The philosophical distinction — bullshitters are indifferent to truth, not opposed to it — sits unused in the AI literature for two decades.
May 2020
Maynez et al, faithfulness in summarization
ACL paper (arxiv 2005.00661) hand-annotates 500 summaries and finds 70% contain hallucinations. First large-scale evidence that open-book hallucination is the dominant failure mode for then-current models.
May 2020
Lewis et al, RAG
NeurIPS paper (arxiv 2005.11401) introduces retrieval-augmented generation as the standard architecture for grounding generation in external knowledge.
September 2021
Lin, Hilton, Evans — TruthfulQA
arxiv 2109.07958 (final at ACL 2022). 817-question benchmark establishes that scale alone makes truthfulness worse on adversarial questions.
February 2022
Ji et al, hallucination survey
arxiv 2202.03629 (final at ACM Computing Surveys v55 art 248, 2023) becomes the canonical taxonomy reference.
March 2022
Wang et al, self-consistency
arxiv 2203.11171 establishes sampling-and-voting as a first-line mitigation for reasoning errors.
May 2023
Zhang et al, snowballing
arxiv 2305.13534 documents that models commit to early errors and defend them with further errors they could otherwise catch.
June 2023
Mata v. Avianca sanction
Judge Castel sanctions attorneys $5,000 for submitting six ChatGPT-fabricated case citations. First high-profile professional-discipline outcome from citation fabrication.
October 2023
Sharma et al, Anthropic sycophancy paper
arxiv 2310.13548 shows sycophancy is general across five state-of-the-art assistants, driven by human preference data favoring agreement.
June 2024
Hicks, Humphries, Slater — ChatGPT is Bullshit
Ethics and Information Technology v26 art 38 (doi 10.1007/s10676-024-09775-5) argues 'bullshit' is the technically correct category, not 'hallucination'.
June 2024
van der Weij et al, sandbagging
arxiv 2406.07358 demonstrates that language models can be prompted or fine-tuned to strategically underperform on evaluations they detect.
What to ship if you must ship
Sources
- [01]
Lin, Hilton, Evans 2022 TruthfulQA benchmark — best model 58% truthful vs 94% human; larger models generally less truthful on misconception-based questions.
arxiv.org/abs/2109.07958
- [02]
Maynez et al 2020 ACL — hand-annotation of 500 abstractive summaries found ~70% contained hallucinated content unfaithful to source document.
arxiv.org/abs/2005.00661
- [03]
Ji et al 2023 survey of hallucination in natural language generation, published in ACM Computing Surveys vol 55 article 248.
arxiv.org/abs/2202.03629
- [04]
Hicks, Humphries, Slater 2024 'ChatGPT is Bullshit' in Ethics and Information Technology v26 article 38; argues Frankfurtian bullshit is the correct category for LLM output.
link.springer.com/article/10.1007/s10676-024-09775-5
- [05]
Sharma et al 2023 (Anthropic) 'Towards Understanding Sycophancy in Language Models' — sycophancy is general across state-of-the-art assistants and driven by preference data.
arxiv.org/abs/2310.13548
- [06]
Zhang et al 2023 'How Language Model Hallucinations Can Snowball' — models can identify 67–87% of their own snowball errors when queried separately.
arxiv.org/abs/2305.13534
- [07]
van der Weij et al 2024 'AI Sandbagging' — language models can strategically underperform on evaluations they detect.
arxiv.org/abs/2406.07358
- [08]
Lewis et al 2020 NeurIPS 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' — foundational RAG paper.
arxiv.org/abs/2005.11401
- [09]
Wang et al 2022 'Self-Consistency Improves Chain of Thought Reasoning' — +17.9% on GSM8K, +11.0% on SVAMP from majority voting over sampled chains.
arxiv.org/abs/2203.11171
- [10]
Wei et al 2022 NeurIPS chain-of-thought prompting paper, establishing intermediate-reasoning prompting as a baseline technique.
arxiv.org/abs/2201.11903
- [11]
Judge Castel sanctioned plaintiff attorneys $5,000 for submitting six ChatGPT-fabricated case citations on 22 June 2023.
Mata v. Avianca Inc, 678 F.Supp.3d 443 (S.D.N.Y. 2023), 22-cv-1461
- [12]
Public-record summary of Mata v. Avianca sanction including docket, judge, fine amount, and fabricated case names.
en.wikipedia.org/wiki/Mata_v._Avianca,_Inc.
- [13]
Frankfurt's philosophical distinction: bullshit is speech produced with indifference to truth, distinct from lying which requires knowing the truth and concealing it.
press.princeton.edu — Frankfurt, On Bullshit (2005)
- [14]
ACL Anthology entry for Maynez, Narayan, Bohnet, McDonald 2020 — confirms publication venue and authorship.
aclanthology.org/2020.acl-main.173/
- [15]
ACL Anthology entry for Lin, Hilton, Evans 2022 — confirms TruthfulQA publication at ACL 2022.
aclanthology.org/2022.acl-long.229/
- [16]
ACM Computing Surveys final version of Ji et al hallucination survey.
dl.acm.org/doi/10.1145/3571730