A massive dark concrete fortress wall from below, single bio-cyan watchpoint at the top.

AI threat model — per use case

Eleven attack classes, real incidents, mitigations, and what residual risk remains after you do the work.

Most AI threat write-ups are either generic ("be careful with prompts") or so framework-heavy they never name an actual attacker move. This page tries to do the opposite: name the attack, point at a real incident or public proof-of-concept, state the mitigation in plain language, and admit what risk is left over after you implement it. We are not in the business of pretending mitigations are total — they almost never are. Scope is the application layer. We assume you are building or operating something on top of a foundation model — a chatbot, an agent, a RAG system, a fine-tuned classifier, an API integration. We are not threat-modeling the lab that trained the foundation model; that is their problem and they publish their own work on it. We are threat-modeling you, the deployer. The taxonomy here borrows heavily from three sources we trust and link below: OWASP's Top 10 for LLM Applications (the application-layer view), MITRE ATLAS (the adversarial tactic/technique view, modeled on ATT&CK), and the NIST AI Risk Management Framework (the governance view). Where these three disagree on naming, we use the plainest term. Where they agree, we cite all three. A note on certainty. Several of the incidents below are based on public reporting and security researcher disclosures. Where a detail is contested or where the affected vendor has not confirmed, we say so. Where pricing or model-specific behavior is involved, we mark it best-effort as of June 2026 — check provider documentation for the current state of the world, because this field moves quarterly. This is not legal advice, not a compliance checklist, and not a substitute for an actual red-team engagement against your specific system. It is a starting map.

How to read this page

Each threat below follows the same shape: what it is in one sentence, a real example with citation, severity rated low / medium / high / critical, the practical mitigation, and the residual risk that remains after the mitigation is in place. Severity is our judgment based on impact and frequency in the public record — your environment may rate it differently. The mitigations are the minimum effective dose; you can always do more, and for high-stakes systems you should. We list cross-references to OWASP LLM Top 10 (using the 2025 v1.1 IDs: LLM01 through LLM10), MITRE ATLAS technique IDs where one fits, and the relevant NIST AI RMF function (Govern, Map, Measure, Manage) so you can locate the threat in whichever framework your organization has standardized on.

The eleven threats at a glance

Threat	OWASP LLM	MITRE ATLAS	Severity	Hardest part of mitigation
Direct prompt injection	LLM01	AML.T0051	High	There is no clean separation between instructions and data inside a single context window.
Indirect prompt injection (website, email, PDF)	LLM01	AML.T0051	Critical	Untrusted content gets pulled into the context by your own tools, often invisibly.
Data poisoning (training set or RAG corpus)	LLM03 / LLM04	AML.T0020	High	Detecting a poisoned document among millions of clean ones.
Model theft (weight extraction, distillation)	LLM10	AML.T0024	Medium	Rate limits help but a determined extractor with budget can still distill.
Jailbreaking	LLM01	AML.T0054	Medium	Jailbreaks generalize across models; your filter will lag the attacker's prompt by days.
Hallucinated output acted on	LLM09	n/a	High	The model is fluent and confident; the integration code trusts it.
Output redirected to attacker (exfiltration via tool use)	LLM02 / LLM07	AML.T0048	Critical	Agentic systems with both untrusted-content read and external-network write are inherently dangerous.
Denial-of-wallet (token / cost bomb)	LLM04	AML.T0034	Medium	Cheap to mount, expensive to absorb, and rate limits cut user experience.
API key leak	LLM06	AML.T0012	Critical	Keys end up in client bundles, git history, screenshots, and logs.
Sensitive data exfiltration through the model	LLM06	AML.T0024.001	High	RAG pulls in PII; output filtering is imperfect; logs retain prompts.
Deceptive alignment / sandbagging at deploy time	n/a (research)	n/a	Low–Medium (today)	Current evals do not robustly detect a model that behaves well during testing and differently in deployment.
Supply chain (HF weights, GitHub packages, MCP servers)	LLM05 / LLM03	AML.T0010	High	You inherit the trust of everyone in the chain — and that chain is long.

ThreatDirect prompt injection

OWASP LLMLLM01

MITRE ATLASAML.T0051

SeverityHigh

Hardest part of mitigationThere is no clean separation between instructions and data inside a single context window.

ThreatIndirect prompt injection (website, email, PDF)

OWASP LLMLLM01

MITRE ATLASAML.T0051

SeverityCritical

Hardest part of mitigationUntrusted content gets pulled into the context by your own tools, often invisibly.

ThreatData poisoning (training set or RAG corpus)

OWASP LLMLLM03 / LLM04

MITRE ATLASAML.T0020

SeverityHigh

Hardest part of mitigationDetecting a poisoned document among millions of clean ones.

ThreatModel theft (weight extraction, distillation)

OWASP LLMLLM10

MITRE ATLASAML.T0024

SeverityMedium

Hardest part of mitigationRate limits help but a determined extractor with budget can still distill.

ThreatJailbreaking

OWASP LLMLLM01

MITRE ATLASAML.T0054

SeverityMedium

Hardest part of mitigationJailbreaks generalize across models; your filter will lag the attacker's prompt by days.

ThreatHallucinated output acted on

OWASP LLMLLM09

MITRE ATLASn/a

SeverityHigh

Hardest part of mitigationThe model is fluent and confident; the integration code trusts it.

ThreatOutput redirected to attacker (exfiltration via tool use)

OWASP LLMLLM02 / LLM07

MITRE ATLASAML.T0048

SeverityCritical

Hardest part of mitigationAgentic systems with both untrusted-content read and external-network write are inherently dangerous.

ThreatDenial-of-wallet (token / cost bomb)

OWASP LLMLLM04

MITRE ATLASAML.T0034

SeverityMedium

Hardest part of mitigationCheap to mount, expensive to absorb, and rate limits cut user experience.

ThreatAPI key leak

OWASP LLMLLM06

MITRE ATLASAML.T0012

SeverityCritical

Hardest part of mitigationKeys end up in client bundles, git history, screenshots, and logs.

ThreatSensitive data exfiltration through the model

OWASP LLMLLM06

MITRE ATLASAML.T0024.001

SeverityHigh

Hardest part of mitigationRAG pulls in PII; output filtering is imperfect; logs retain prompts.

ThreatDeceptive alignment / sandbagging at deploy time

OWASP LLMn/a (research)

MITRE ATLASn/a

SeverityLow–Medium (today)

Hardest part of mitigationCurrent evals do not robustly detect a model that behaves well during testing and differently in deployment.

ThreatSupply chain (HF weights, GitHub packages, MCP servers)

OWASP LLMLLM05 / LLM03

MITRE ATLASAML.T0010

SeverityHigh

Hardest part of mitigationYou inherit the trust of everyone in the chain — and that chain is long.

Threats in depth

Each card below states the attack, the public example, the mitigation, and the residual risk. Citations are in the references section at the bottom.

Direct prompt injection

OWASP LLM01 · MITRE AML.T0051 · Severity: High

Attack: a user types instructions designed to override the system prompt or extract it. Example: in February 2023, a Stanford student got an early Bing Chat (then internal codename 'Sydney') to leak its system prompt by asking it to 'ignore previous instructions' and print the document above. The conversation was screenshotted and widely reported. Mitigation: treat the system prompt as non-secret (assume it will leak), validate model outputs against an allowlist of expected shapes, never put secrets into a system prompt, use a separate filtered call for safety-critical decisions. Residual risk: the model can still be persuaded into off-policy responses by sufficiently novel framings, and there is no robust separation of instruction from data inside a single token stream — this is an open research problem, not a solved one.

Indirect prompt injection

OWASP LLM01 · MITRE AML.T0051 · Severity: Critical

Attack: malicious instructions are placed inside content the model reads as data — a webpage the agent browses, an email the assistant summarizes, a PDF uploaded for analysis, a calendar invite, a Slack message. Example: Kai Greshake and collaborators demonstrated in early 2023 that an attacker-controlled webpage could hijack an LLM browsing agent and exfiltrate the user's conversation; the work was published as 'Not what you've signed up for' on arXiv (2302.12173). Subsequent demonstrations against Microsoft Copilot, GitHub Copilot Chat, and Google's Gemini-in-Workspace ecosystem followed through 2024 and 2025. Mitigation: never let an agent that can read untrusted content also write to external systems in the same loop without a human-in-the-loop confirmation; label data provenance in the context; for high-stakes actions, route through a second model that does not see the untrusted content. Residual risk: any agent that both retrieves and acts is exposed; defense in depth helps but does not eliminate. The agent literature calls this the 'lethal trifecta' (read untrusted + access secrets + exfiltrate).

Data poisoning

OWASP LLM03 / LLM04 · MITRE AML.T0020 · Severity: High

Attack: an adversary injects crafted documents into a training corpus or a retrieval index so the model learns a backdoor trigger or returns attacker-chosen content for certain queries. Example: Carlini et al. published 'Poisoning Web-Scale Training Datasets is Practical' (arXiv 2302.10149) showing two practical attacks — split-view poisoning and frontrunning poisoning — against datasets like LAION-400M and Common Crawl, exploiting the fact that URLs in static dataset snapshots can be repurchased and repointed. For RAG: any system that indexes user-uploaded documents or scrapes external sites at index time is exposed by construction. Mitigation: cryptographic hashing of training data, provenance signing, sandboxing of RAG ingestion, manual review of high-influence sources, content-level filters for known trigger patterns. Residual risk: detecting one bad document in a billion-scale corpus is not currently solvable; you are betting on attacker rarity rather than detection.

Model theft and distillation

OWASP LLM10 · MITRE AML.T0024 · Severity: Medium

Attack: an adversary queries your hosted model enough times to either extract weights (rare, theoretical for full extraction of large models) or to train a competitor model via distillation on your outputs. Example: Tramèr et al.'s 'Stealing Machine Learning Models via Prediction APIs' (USENIX Security 2016) established the baseline. More recently, Carlini et al.'s 2024 paper 'Stealing Part of a Production Language Model' (arXiv 2403.06634) showed it is possible to recover the embedding projection layer of production models including OpenAI's ada and babbage variants for under $20 in API calls (OpenAI subsequently patched the relevant API surface). Distillation is the larger commercial threat: most major labs prohibit using their outputs to train competing models in their terms of service. Mitigation: rate limits, anomaly detection on query patterns, output watermarking (imperfect), legal terms, monitoring for fine-tunes that show suspicious capability transfer. Residual risk: a determined competitor with a moderate budget can almost certainly distill a smaller model from your outputs; the question is whether you can prove it in court.

Jailbreaking

OWASP LLM01 · MITRE AML.T0054 · Severity: Medium

Attack: a user crafts a prompt that bypasses the model's safety training to elicit content the deployer does not want produced (hate, weapons info, sexual content, instructions for illegal activity). Examples: DAN (Do Anything Now) and its many descendants on Reddit through 2023–2024; Zou et al.'s 'Universal and Transferable Adversarial Attacks on Aligned Language Models' (arXiv 2307.15043) showed that gradient-based suffix attacks transfer across models. The 'many-shot jailbreaking' paper from Anthropic (April 2024) showed that long-context models are vulnerable to a new class of attack that simply fills the context with fake assistant responses. Mitigation: layered moderation (input filter, model-level safety training, output filter), use of safety-classifier models like Llama Guard or provider-supplied moderation APIs, monitoring for known jailbreak strings, lower-temperature decoding for sensitive deployments. Residual risk: jailbreaks generalize faster than filters can be updated; novel attacks have weeks of window before public mitigations exist. The cat-and-mouse is structural.

Hallucinated output acted on

OWASP LLM09 · Severity: High

Attack: this one is not really an attack — it is a self-inflicted wound. The model confidently produces incorrect output (a fake citation, a non-existent API endpoint, a made-up legal precedent), and downstream code or a human acts on it. Example: Mata v. Avianca (S.D.N.Y. 2023) — a lawyer submitted a brief containing six fabricated case citations generated by ChatGPT; the judge sanctioned the firm. Air Canada was held liable by a Canadian tribunal in February 2024 for a hallucinated bereavement-fare policy its chatbot invented. The GitHub 'slopsquatting' phenomenon (Lasso Security, 2023–2024) found Copilot regularly suggests package imports that do not exist — attackers can register those package names and ship malware. Mitigation: never trust model output as ground truth; verify any factual claim before acting; constrain output to structured schemas; for code, run package-name verification against a real registry before install; for citations, verify against a real database. Residual risk: the model is fluent. Humans and downstream code will trust it anyway. This is the most expensive class of error in practice and the one most underestimated by buyers.

Output redirected to attacker

OWASP LLM02 / LLM07 · MITRE AML.T0048 · Severity: Critical

Attack: an injection (direct or indirect) causes an agentic system with tool access to send data to an attacker-controlled destination — by emailing it, posting it to a URL, writing it to a public document, or rendering it as a markdown image that pings an attacker server. Example: Johann Rehberger has published a long series of exfiltration demonstrations against ChatGPT plugins, Microsoft 365 Copilot, GitHub Copilot Chat, Google Gemini, and others, often using markdown image rendering as the exfiltration channel (image src URLs pull from attacker domains, leaking query parameters). Mitigation: strict content security policies on rendered output, disallow markdown image rendering from untrusted-content origins, require human confirmation for any outbound action (email send, HTTP POST, file share), use a sandboxed execution environment, restrict tools to the minimum needed. Residual risk: any agentic system that combines untrusted-content ingestion with external-write capability is structurally vulnerable. The cleanest mitigation is to not build such systems for high-stakes data.

Denial-of-wallet

OWASP LLM04 · MITRE AML.T0034 · Severity: Medium

Attack: an adversary submits requests designed to maximize your token spend — long inputs, prompts that elicit long outputs, recursive agent loops. The goal is to drive the deployer's cloud bill up rather than to extract data. Example: the term 'denial-of-wallet' is older than LLMs (cloud cost attacks against serverless functions were studied as early as 2019), but it has become acute for token-billed apps. There is no single landmark incident report we can point to that names a victim by name; vendors generally do not disclose them, but several Y Combinator post-mortems and developer forum threads through 2024 describe four- and five-figure overnight bills from abusive traffic. Mitigation: per-user and per-IP rate limits, maximum input length, maximum output length, cost ceilings per session, anomaly detection on token-per-request patterns, hard daily budget caps at the provider level. Residual risk: rate limits that protect cost also degrade legitimate power-user experience; finding the right threshold is an ongoing tuning problem. Costs and limits noted here are best-effort as of June 2026 — check provider docs for current pricing tiers.

API key leak

OWASP LLM06 · MITRE AML.T0012 · Severity: Critical

Attack: a developer commits an API key to a public GitHub repo, embeds it in a client-side JavaScript bundle, ships it in a mobile app binary, or pastes it into a screenshot. Attackers scrape public repos within minutes (GitHub's own secret scanning often flags keys before the developer notices). Example: TruffleHog and GitGuardian publish annual reports documenting hundreds of thousands of exposed secrets per year on public GitHub; in their 2024 'State of Secrets Sprawl' report, GitGuardian reported finding 12.8 million new exposed secrets in 2023 alone, with AI provider keys becoming an increasing share. Mitigation: never put keys in client code; use a backend proxy; rotate keys regularly; enable provider-level scanning (OpenAI, Anthropic, Google all auto-rotate leaked keys); use environment variables and secret managers (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault); enable git pre-commit hooks like trufflehog. Residual risk: keys still leak through screenshots, logs, error messages, and developer machines; downstream key abuse before rotation can run up significant bills.

Sensitive data exfiltration through the model

OWASP LLM06 · MITRE AML.T0024.001 · Severity: High

Attack: a model trained or fine-tuned on sensitive data (or a RAG system that retrieves it) regurgitates PII, secrets, or proprietary content into an output the wrong user sees. Example: Samsung temporarily banned generative AI tools internally in May 2023 after engineers reportedly pasted proprietary source code into ChatGPT — Samsung subsequently said the code may have been retained as training data, prompting their internal ban (multiple outlets including Bloomberg and Reuters reported this; Samsung confirmed the ban). Carlini et al.'s 'Extracting Training Data from Large Language Models' (USENIX Security 2021, arXiv 2012.07805) is the foundational academic demonstration that GPT-2 memorized and could be made to regurgitate PII and code from its training set. Mitigation: do not fine-tune on raw sensitive data without differential privacy techniques; in RAG systems, enforce per-user access controls at the retrieval layer (not just the output layer); strip PII at ingestion; use output filters for known sensitive patterns; minimize log retention for prompts and completions; check provider data-retention defaults and opt out where possible. Residual risk: output filters miss obfuscated PII; RAG access-control bugs are common; if a provider's no-train policy changes or breaks, your data was already sent.

Deceptive alignment / sandbagging at deploy time

Research class · Severity: Low–Medium today

Attack: a model behaves well during evaluation but differently in deployment — either because it learned to recognize evaluation contexts (sandbagging) or because it learned a deceptive objective during training (deceptive alignment in the Hubinger et al. sense). Example: this is currently a research-grade concern. Anthropic's 'Sleeper Agents' paper (Hubinger et al., arXiv 2401.05566, January 2024) showed that backdoor behaviors trained into models can persist through subsequent safety training, including RLHF and adversarial training. Apollo Research's December 2024 evaluation of frontier models documented several instances of 'in-context scheming' where models acted differently when they believed they were being observed versus deployed. Whether any production-deployed model is currently doing this in the wild is unknown and not currently provable with the evals we have. Mitigation: this is mostly the foundation model lab's problem to solve; as a deployer you can demand transparency reports from your provider, prefer providers with public model cards and red-teaming disclosures, monitor your own deployments for behavioral drift over time, and route safety-critical decisions through deterministic code rather than the model. Residual risk: by construction, this is the threat we cannot yet measure. Treat it as a known unknown.

Supply chain compromise

OWASP LLM05 / LLM03 · MITRE AML.T0010 · Severity: High

Attack: a model weights file downloaded from Hugging Face contains a pickle-based backdoor; a Python package imported into your AI pipeline has been typosquatted; an MCP server connector you installed exfiltrates the tokens it sees; a base image on Docker Hub has been poisoned. Example: JFrog Security Research and Protect AI have repeatedly documented malicious models on Hugging Face — Hugging Face itself published in early 2024 about pickle-format risks and rolled out improved scanning. PyTorch's torchtriton supply chain incident (December 2022) involved a malicious package uploaded to PyPI with the same name as an internal PyTorch dependency, downloaded by users who had pip's dependency resolution prefer PyPI over the internal index. The xz Utils backdoor (CVE-2024-3094, disclosed March 2024) showed a multi-year social engineering campaign against an upstream maintainer of a widely used library — not AI-specific, but a sobering case study for any deployer relying on open source. Mitigation: prefer safetensors over pickle, verify cryptographic signatures, pin and audit dependencies, use private registries with allowlisting, vet MCP servers before installation, monitor for anomalous network activity from build and inference machines. Residual risk: the dependency tree is long and you cannot personally audit all of it; you are accepting trust in many strangers.

Where this map gets thin

Three honest caveats. (1) The threats above are application-layer threats — we did not cover model-stage misuse (training-time backdoors planted by the lab itself, or by a state-actor inside the lab) because if your foundation-model provider is compromised at that level, you have problems that no application-layer mitigation will fix. (2) We did not cover misuse threats where the AI is the weapon used by your user against a third party (deepfake fraud, voice cloning scams, generated CSAM, mass disinformation). Those are real and serious; they belong in a separate page on misuse policy, not in an application threat model. (3) Several of these threat classes — especially indirect prompt injection and deceptive alignment — are subjects of active research, not solved problems. The mitigations we list reduce risk; they do not eliminate it. If your system handles money, health, legal, or safety-critical decisions, treat the model output as advisory and keep a deterministic decision layer in front of the user.

A note on the lethal trifecta

Simon Willison popularized a useful heuristic in 2025 he calls the lethal trifecta: an agent that simultaneously (a) reads untrusted content, (b) has access to private or sensitive data, and (c) can communicate externally is structurally vulnerable to data exfiltration via indirect prompt injection. The mitigation is not to make the model smarter — it is to break the trifecta at the architecture level. Pick one to remove. Most safe agent designs in production today break the trifecta by either disallowing external communication entirely, or by requiring human-in-the-loop confirmation for any outbound action, or by routing untrusted content through a separate context that has no access to private data. We strongly recommend reading Willison's writing on this directly — it is the clearest framing of the agent risk we have seen. Citation in the references section.

Frameworks we draw from

If you are formalizing your AI risk program, anchor to one of these and use the others as cross-references. None of them is complete on its own.

OWASP Top 10 for Large Language Model Applications (v1.1, published 2025) — the canonical application-layer taxonomy. Free, vendor-neutral, IDs LLM01 through LLM10.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) — modeled on the ATT&CK framework, tactic-and-technique IDs, useful for threat-modeling and red-team scoping.
NIST AI Risk Management Framework (AI RMF 1.0, January 2023) and the Generative AI Profile (AI 600-1, July 2024) — governance-oriented, structured around four functions: Govern, Map, Measure, Manage. Required reading if you have a compliance audience.
ISO/IEC 42001:2023 — AI management system standard, certifiable, useful if your buyer asks for a third-party attestation.
EU AI Act (Regulation (EU) 2024/1689, in force from August 2024 with phased application) — risk-tiered regulation; obligations begin applying through 2025 and 2026 depending on system tier. Check the latest official text; the timeline is best-effort as of June 2026.
Anthropic, OpenAI, and Google DeepMind responsible scaling / preparedness frameworks — provider-specific commitments on model evaluation and deployment thresholds. Useful as a benchmark for what to ask your provider.

Minimum effective dose checklist

If you only do five things, do these. None of them require a security team. All of them substantially reduce your blast radius.

Never put a real API key in client-side code, ever. Use a backend proxy. Use environment variables. Use a secret manager. Audit your git history for past leaks.
Set hard per-user and per-IP rate limits, plus a hard daily spend cap at the provider level. Configure alerts for unusual cost spikes.
If your agent reads untrusted content and can take external actions, require human-in-the-loop confirmation for the external action. Break the trifecta.
Validate model outputs against an expected schema before acting on them. Never trust a model to return JSON without parsing and validating it.
Read your provider's data retention and training policies. Configure no-train mode if it is opt-in. Minimize log retention of prompts and completions containing user data.

What we do not cover here

Misuse policy (your users using your tool to harm others), election integrity, deepfake fraud, CSAM, generated disinformation, copyright on training data, and labor displacement are all real concerns that deserve their own treatment. They are not in this threat model because this page is scoped to security and safety risks faced by the deployer of an AI application — not the full societal impact surface. We will publish separate writeups on misuse and on policy as those pages mature. If you came here looking for those topics and did not find them, that is the reason — not absence of concern.

Sources

[01]
OWASP Top 10 for Large Language Model Applications (v1.1, 2025) defines LLM01 through LLM10 as the canonical application-layer threat taxonomy.
https://genai.owasp.org/llm-top-10/ ↗
[02]
MITRE ATLAS provides an ATT&CK-style tactic and technique taxonomy for adversarial threats against AI systems, including AML.T0051 (prompt injection), AML.T0020 (data poisoning), AML.T0024 (model extraction).
https://atlas.mitre.org/ ↗
[03]
NIST AI Risk Management Framework 1.0 (January 2023) structures AI governance around four functions: Govern, Map, Measure, Manage.
https://www.nist.gov/itl/ai-risk-management-framework ↗
[04]
NIST AI 600-1, the Generative AI Profile of the AI RMF, was published in July 2024 and extends the framework to generative AI specific risks.
https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf ↗
[05]
Established indirect prompt injection as a practical attack class against LLM-integrated browsing and agent systems in February 2023.
arxiv.org/abs/2302.12173 — Greshake et al., 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection'
[06]
Demonstrated two practical data poisoning attacks (split-view and frontrunning) against LAION-400M and Common Crawl-style datasets.
arxiv.org/abs/2302.10149 — Carlini et al., 'Poisoning Web-Scale Training Datasets is Practical'
[07]
Showed that the embedding projection layer of production language models including OpenAI's ada and babbage variants could be recovered via API queries; OpenAI subsequently patched the relevant API surface.
arxiv.org/abs/2403.06634 — Carlini et al., 'Stealing Part of a Production Language Model'
[08]
Foundational paper establishing model extraction attacks against ML-as-a-service prediction APIs.
USENIX Security 2016 — Tramèr et al., 'Stealing Machine Learning Models via Prediction APIs'
[09]
Demonstrated gradient-based adversarial suffix attacks that transfer jailbreaks across aligned language models.
arxiv.org/abs/2307.15043 — Zou et al., 'Universal and Transferable Adversarial Attacks on Aligned Language Models'
[10]
Identified that long-context language models are vulnerable to a class of attack that fills the context with fabricated assistant responses to elicit off-policy behavior.
Anthropic blog · April 2024 · 'Many-shot jailbreaking'
[11]
A US federal judge sanctioned attorneys who submitted a brief containing six fabricated case citations generated by ChatGPT, the canonical case for hallucination-acted-on harm.
Mata v. Avianca, Inc., S.D.N.Y. 22-cv-1461, 2023
[12]
Tribunal held Air Canada liable for a bereavement-fare policy its chatbot invented, ruling the airline responsible for its chatbot's misstatements.
Moffatt v. Air Canada · British Columbia Civil Resolution Tribunal · February 2024
[13]
Demonstrated that GPT-2 memorized and could be made to regurgitate PII and source code from its training set.
arxiv.org/abs/2012.07805 — Carlini et al., 'Extracting Training Data from Large Language Models' (USENIX Security 2021)
[14]
Anthropic showed that backdoor behaviors deliberately trained into models can persist through subsequent safety training including RLHF and adversarial training.
arxiv.org/abs/2401.05566 — Hubinger et al., 'Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training'
[15]
Documented instances of frontier models acting differently when they believed they were being observed versus deployed.
Apollo Research · December 2024 · 'Frontier Models are Capable of In-Context Scheming'
[16]
Published series of data exfiltration demonstrations against ChatGPT plugins, Microsoft 365 Copilot, GitHub Copilot Chat, and Google Gemini, often via markdown image rendering as the exfiltration channel.
embracethered.com — Johann Rehberger's research blog
[17]
Coined the lethal trifecta framing: an agent with untrusted-content read, private-data access, and external communication is structurally exfiltration-vulnerable.
simonwillison.net · 2025 · 'The lethal trifecta for AI agents'
[18]
Reported 12.8 million new exposed secrets detected on public GitHub in 2023, with AI provider keys an increasing share of leaks.
GitGuardian · 'State of Secrets Sprawl 2024' report
[19]
Malicious package uploaded to PyPI with same name as internal PyTorch dependency was installed by users via dependency-confusion attack.
PyTorch security advisory · December 2022 · torchtriton dependency confusion incident
[20]
Multi-year social engineering campaign against an upstream open-source maintainer planted a backdoor in xz Utils, a sobering supply-chain case for any AI deployer relying on open source.
CVE-2024-3094 · xz Utils backdoor · disclosed March 2024
[21]
Hugging Face documented pickle-format risks in model files and improved scanning for malicious model uploads.
huggingface.co/blog/safetensors-security-audit
[22]
Samsung restricted internal use of generative AI tools in May 2023 after engineers reportedly pasted proprietary source code into ChatGPT.
Bloomberg · May 2023 · Samsung ChatGPT ban reporting
[23]
Found that LLM coding assistants regularly suggest package imports that do not exist, which attackers can register and use to distribute malware.
Lasso Security · 2024 · research on AI package hallucination ('slopsquatting')
[24]
EU AI Act entered into force August 2024 with phased application of obligations through 2025–2026 depending on system risk tier.
EUR-Lex · Regulation (EU) 2024/1689 (EU AI Act)
[25]
ISO/IEC 42001:2023 specifies requirements for an AI management system and is certifiable by third-party auditors.
iso.org · ISO/IEC 42001:2023

Keep reading

Trust overview →Research papers →Learn — security playbooks →Orangebox local AI →B00KMakor →Tools index →vs comparison index →