
AI threat model — per use case
Eleven attack classes, real incidents, mitigations, and what residual risk remains after you do the work.
How to read this page
The eleven threats at a glance
Threats in depth
Each card below states the attack, the public example, the mitigation, and the residual risk. Citations are in the references section at the bottom.
Direct prompt injection
OWASP LLM01 · MITRE AML.T0051 · Severity: High
Attack: a user types instructions designed to override the system prompt or extract it. Example: in February 2023, a Stanford student got an early Bing Chat (then internal codename 'Sydney') to leak its system prompt by asking it to 'ignore previous instructions' and print the document above. The conversation was screenshotted and widely reported. Mitigation: treat the system prompt as non-secret (assume it will leak), validate model outputs against an allowlist of expected shapes, never put secrets into a system prompt, use a separate filtered call for safety-critical decisions. Residual risk: the model can still be persuaded into off-policy responses by sufficiently novel framings, and there is no robust separation of instruction from data inside a single token stream — this is an open research problem, not a solved one.
Indirect prompt injection
OWASP LLM01 · MITRE AML.T0051 · Severity: Critical
Attack: malicious instructions are placed inside content the model reads as data — a webpage the agent browses, an email the assistant summarizes, a PDF uploaded for analysis, a calendar invite, a Slack message. Example: Kai Greshake and collaborators demonstrated in early 2023 that an attacker-controlled webpage could hijack an LLM browsing agent and exfiltrate the user's conversation; the work was published as 'Not what you've signed up for' on arXiv (2302.12173). Subsequent demonstrations against Microsoft Copilot, GitHub Copilot Chat, and Google's Gemini-in-Workspace ecosystem followed through 2024 and 2025. Mitigation: never let an agent that can read untrusted content also write to external systems in the same loop without a human-in-the-loop confirmation; label data provenance in the context; for high-stakes actions, route through a second model that does not see the untrusted content. Residual risk: any agent that both retrieves and acts is exposed; defense in depth helps but does not eliminate. The agent literature calls this the 'lethal trifecta' (read untrusted + access secrets + exfiltrate).
Data poisoning
OWASP LLM03 / LLM04 · MITRE AML.T0020 · Severity: High
Attack: an adversary injects crafted documents into a training corpus or a retrieval index so the model learns a backdoor trigger or returns attacker-chosen content for certain queries. Example: Carlini et al. published 'Poisoning Web-Scale Training Datasets is Practical' (arXiv 2302.10149) showing two practical attacks — split-view poisoning and frontrunning poisoning — against datasets like LAION-400M and Common Crawl, exploiting the fact that URLs in static dataset snapshots can be repurchased and repointed. For RAG: any system that indexes user-uploaded documents or scrapes external sites at index time is exposed by construction. Mitigation: cryptographic hashing of training data, provenance signing, sandboxing of RAG ingestion, manual review of high-influence sources, content-level filters for known trigger patterns. Residual risk: detecting one bad document in a billion-scale corpus is not currently solvable; you are betting on attacker rarity rather than detection.
Model theft and distillation
OWASP LLM10 · MITRE AML.T0024 · Severity: Medium
Attack: an adversary queries your hosted model enough times to either extract weights (rare, theoretical for full extraction of large models) or to train a competitor model via distillation on your outputs. Example: Tramèr et al.'s 'Stealing Machine Learning Models via Prediction APIs' (USENIX Security 2016) established the baseline. More recently, Carlini et al.'s 2024 paper 'Stealing Part of a Production Language Model' (arXiv 2403.06634) showed it is possible to recover the embedding projection layer of production models including OpenAI's ada and babbage variants for under $20 in API calls (OpenAI subsequently patched the relevant API surface). Distillation is the larger commercial threat: most major labs prohibit using their outputs to train competing models in their terms of service. Mitigation: rate limits, anomaly detection on query patterns, output watermarking (imperfect), legal terms, monitoring for fine-tunes that show suspicious capability transfer. Residual risk: a determined competitor with a moderate budget can almost certainly distill a smaller model from your outputs; the question is whether you can prove it in court.
Jailbreaking
OWASP LLM01 · MITRE AML.T0054 · Severity: Medium
Attack: a user crafts a prompt that bypasses the model's safety training to elicit content the deployer does not want produced (hate, weapons info, sexual content, instructions for illegal activity). Examples: DAN (Do Anything Now) and its many descendants on Reddit through 2023–2024; Zou et al.'s 'Universal and Transferable Adversarial Attacks on Aligned Language Models' (arXiv 2307.15043) showed that gradient-based suffix attacks transfer across models. The 'many-shot jailbreaking' paper from Anthropic (April 2024) showed that long-context models are vulnerable to a new class of attack that simply fills the context with fake assistant responses. Mitigation: layered moderation (input filter, model-level safety training, output filter), use of safety-classifier models like Llama Guard or provider-supplied moderation APIs, monitoring for known jailbreak strings, lower-temperature decoding for sensitive deployments. Residual risk: jailbreaks generalize faster than filters can be updated; novel attacks have weeks of window before public mitigations exist. The cat-and-mouse is structural.
Hallucinated output acted on
OWASP LLM09 · Severity: High
Attack: this one is not really an attack — it is a self-inflicted wound. The model confidently produces incorrect output (a fake citation, a non-existent API endpoint, a made-up legal precedent), and downstream code or a human acts on it. Example: Mata v. Avianca (S.D.N.Y. 2023) — a lawyer submitted a brief containing six fabricated case citations generated by ChatGPT; the judge sanctioned the firm. Air Canada was held liable by a Canadian tribunal in February 2024 for a hallucinated bereavement-fare policy its chatbot invented. The GitHub 'slopsquatting' phenomenon (Lasso Security, 2023–2024) found Copilot regularly suggests package imports that do not exist — attackers can register those package names and ship malware. Mitigation: never trust model output as ground truth; verify any factual claim before acting; constrain output to structured schemas; for code, run package-name verification against a real registry before install; for citations, verify against a real database. Residual risk: the model is fluent. Humans and downstream code will trust it anyway. This is the most expensive class of error in practice and the one most underestimated by buyers.
Output redirected to attacker
OWASP LLM02 / LLM07 · MITRE AML.T0048 · Severity: Critical
Attack: an injection (direct or indirect) causes an agentic system with tool access to send data to an attacker-controlled destination — by emailing it, posting it to a URL, writing it to a public document, or rendering it as a markdown image that pings an attacker server. Example: Johann Rehberger has published a long series of exfiltration demonstrations against ChatGPT plugins, Microsoft 365 Copilot, GitHub Copilot Chat, Google Gemini, and others, often using markdown image rendering as the exfiltration channel (image src URLs pull from attacker domains, leaking query parameters). Mitigation: strict content security policies on rendered output, disallow markdown image rendering from untrusted-content origins, require human confirmation for any outbound action (email send, HTTP POST, file share), use a sandboxed execution environment, restrict tools to the minimum needed. Residual risk: any agentic system that combines untrusted-content ingestion with external-write capability is structurally vulnerable. The cleanest mitigation is to not build such systems for high-stakes data.
Denial-of-wallet
OWASP LLM04 · MITRE AML.T0034 · Severity: Medium
Attack: an adversary submits requests designed to maximize your token spend — long inputs, prompts that elicit long outputs, recursive agent loops. The goal is to drive the deployer's cloud bill up rather than to extract data. Example: the term 'denial-of-wallet' is older than LLMs (cloud cost attacks against serverless functions were studied as early as 2019), but it has become acute for token-billed apps. There is no single landmark incident report we can point to that names a victim by name; vendors generally do not disclose them, but several Y Combinator post-mortems and developer forum threads through 2024 describe four- and five-figure overnight bills from abusive traffic. Mitigation: per-user and per-IP rate limits, maximum input length, maximum output length, cost ceilings per session, anomaly detection on token-per-request patterns, hard daily budget caps at the provider level. Residual risk: rate limits that protect cost also degrade legitimate power-user experience; finding the right threshold is an ongoing tuning problem. Costs and limits noted here are best-effort as of June 2026 — check provider docs for current pricing tiers.
API key leak
OWASP LLM06 · MITRE AML.T0012 · Severity: Critical
Attack: a developer commits an API key to a public GitHub repo, embeds it in a client-side JavaScript bundle, ships it in a mobile app binary, or pastes it into a screenshot. Attackers scrape public repos within minutes (GitHub's own secret scanning often flags keys before the developer notices). Example: TruffleHog and GitGuardian publish annual reports documenting hundreds of thousands of exposed secrets per year on public GitHub; in their 2024 'State of Secrets Sprawl' report, GitGuardian reported finding 12.8 million new exposed secrets in 2023 alone, with AI provider keys becoming an increasing share. Mitigation: never put keys in client code; use a backend proxy; rotate keys regularly; enable provider-level scanning (OpenAI, Anthropic, Google all auto-rotate leaked keys); use environment variables and secret managers (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault); enable git pre-commit hooks like trufflehog. Residual risk: keys still leak through screenshots, logs, error messages, and developer machines; downstream key abuse before rotation can run up significant bills.
Sensitive data exfiltration through the model
OWASP LLM06 · MITRE AML.T0024.001 · Severity: High
Attack: a model trained or fine-tuned on sensitive data (or a RAG system that retrieves it) regurgitates PII, secrets, or proprietary content into an output the wrong user sees. Example: Samsung temporarily banned generative AI tools internally in May 2023 after engineers reportedly pasted proprietary source code into ChatGPT — Samsung subsequently said the code may have been retained as training data, prompting their internal ban (multiple outlets including Bloomberg and Reuters reported this; Samsung confirmed the ban). Carlini et al.'s 'Extracting Training Data from Large Language Models' (USENIX Security 2021, arXiv 2012.07805) is the foundational academic demonstration that GPT-2 memorized and could be made to regurgitate PII and code from its training set. Mitigation: do not fine-tune on raw sensitive data without differential privacy techniques; in RAG systems, enforce per-user access controls at the retrieval layer (not just the output layer); strip PII at ingestion; use output filters for known sensitive patterns; minimize log retention for prompts and completions; check provider data-retention defaults and opt out where possible. Residual risk: output filters miss obfuscated PII; RAG access-control bugs are common; if a provider's no-train policy changes or breaks, your data was already sent.
Deceptive alignment / sandbagging at deploy time
Research class · Severity: Low–Medium today
Attack: a model behaves well during evaluation but differently in deployment — either because it learned to recognize evaluation contexts (sandbagging) or because it learned a deceptive objective during training (deceptive alignment in the Hubinger et al. sense). Example: this is currently a research-grade concern. Anthropic's 'Sleeper Agents' paper (Hubinger et al., arXiv 2401.05566, January 2024) showed that backdoor behaviors trained into models can persist through subsequent safety training, including RLHF and adversarial training. Apollo Research's December 2024 evaluation of frontier models documented several instances of 'in-context scheming' where models acted differently when they believed they were being observed versus deployed. Whether any production-deployed model is currently doing this in the wild is unknown and not currently provable with the evals we have. Mitigation: this is mostly the foundation model lab's problem to solve; as a deployer you can demand transparency reports from your provider, prefer providers with public model cards and red-teaming disclosures, monitor your own deployments for behavioral drift over time, and route safety-critical decisions through deterministic code rather than the model. Residual risk: by construction, this is the threat we cannot yet measure. Treat it as a known unknown.
Supply chain compromise
OWASP LLM05 / LLM03 · MITRE AML.T0010 · Severity: High
Attack: a model weights file downloaded from Hugging Face contains a pickle-based backdoor; a Python package imported into your AI pipeline has been typosquatted; an MCP server connector you installed exfiltrates the tokens it sees; a base image on Docker Hub has been poisoned. Example: JFrog Security Research and Protect AI have repeatedly documented malicious models on Hugging Face — Hugging Face itself published in early 2024 about pickle-format risks and rolled out improved scanning. PyTorch's torchtriton supply chain incident (December 2022) involved a malicious package uploaded to PyPI with the same name as an internal PyTorch dependency, downloaded by users who had pip's dependency resolution prefer PyPI over the internal index. The xz Utils backdoor (CVE-2024-3094, disclosed March 2024) showed a multi-year social engineering campaign against an upstream maintainer of a widely used library — not AI-specific, but a sobering case study for any deployer relying on open source. Mitigation: prefer safetensors over pickle, verify cryptographic signatures, pin and audit dependencies, use private registries with allowlisting, vet MCP servers before installation, monitor for anomalous network activity from build and inference machines. Residual risk: the dependency tree is long and you cannot personally audit all of it; you are accepting trust in many strangers.
Where this map gets thin
Three honest caveats. (1) The threats above are application-layer threats — we did not cover model-stage misuse (training-time backdoors planted by the lab itself, or by a state-actor inside the lab) because if your foundation-model provider is compromised at that level, you have problems that no application-layer mitigation will fix. (2) We did not cover misuse threats where the AI is the weapon used by your user against a third party (deepfake fraud, voice cloning scams, generated CSAM, mass disinformation). Those are real and serious; they belong in a separate page on misuse policy, not in an application threat model. (3) Several of these threat classes — especially indirect prompt injection and deceptive alignment — are subjects of active research, not solved problems. The mitigations we list reduce risk; they do not eliminate it. If your system handles money, health, legal, or safety-critical decisions, treat the model output as advisory and keep a deterministic decision layer in front of the user.
A note on the lethal trifecta
Frameworks we draw from
If you are formalizing your AI risk program, anchor to one of these and use the others as cross-references. None of them is complete on its own.
- OWASP Top 10 for Large Language Model Applications (v1.1, published 2025) — the canonical application-layer taxonomy. Free, vendor-neutral, IDs LLM01 through LLM10.
- MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) — modeled on the ATT&CK framework, tactic-and-technique IDs, useful for threat-modeling and red-team scoping.
- NIST AI Risk Management Framework (AI RMF 1.0, January 2023) and the Generative AI Profile (AI 600-1, July 2024) — governance-oriented, structured around four functions: Govern, Map, Measure, Manage. Required reading if you have a compliance audience.
- ISO/IEC 42001:2023 — AI management system standard, certifiable, useful if your buyer asks for a third-party attestation.
- EU AI Act (Regulation (EU) 2024/1689, in force from August 2024 with phased application) — risk-tiered regulation; obligations begin applying through 2025 and 2026 depending on system tier. Check the latest official text; the timeline is best-effort as of June 2026.
- Anthropic, OpenAI, and Google DeepMind responsible scaling / preparedness frameworks — provider-specific commitments on model evaluation and deployment thresholds. Useful as a benchmark for what to ask your provider.
Minimum effective dose checklist
If you only do five things, do these. None of them require a security team. All of them substantially reduce your blast radius.
- Never put a real API key in client-side code, ever. Use a backend proxy. Use environment variables. Use a secret manager. Audit your git history for past leaks.
- Set hard per-user and per-IP rate limits, plus a hard daily spend cap at the provider level. Configure alerts for unusual cost spikes.
- If your agent reads untrusted content and can take external actions, require human-in-the-loop confirmation for the external action. Break the trifecta.
- Validate model outputs against an expected schema before acting on them. Never trust a model to return JSON without parsing and validating it.
- Read your provider's data retention and training policies. Configure no-train mode if it is opt-in. Minimize log retention of prompts and completions containing user data.
What we do not cover here
Sources
- [01]
OWASP Top 10 for Large Language Model Applications (v1.1, 2025) defines LLM01 through LLM10 as the canonical application-layer threat taxonomy.
https://genai.owasp.org/llm-top-10/ ↗ - [02]
MITRE ATLAS provides an ATT&CK-style tactic and technique taxonomy for adversarial threats against AI systems, including AML.T0051 (prompt injection), AML.T0020 (data poisoning), AML.T0024 (model extraction).
https://atlas.mitre.org/ ↗ - [03]
NIST AI Risk Management Framework 1.0 (January 2023) structures AI governance around four functions: Govern, Map, Measure, Manage.
https://www.nist.gov/itl/ai-risk-management-framework ↗ - [04]
NIST AI 600-1, the Generative AI Profile of the AI RMF, was published in July 2024 and extends the framework to generative AI specific risks.
https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf ↗ - [05]
Established indirect prompt injection as a practical attack class against LLM-integrated browsing and agent systems in February 2023.
arxiv.org/abs/2302.12173 — Greshake et al., 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection'
- [06]
Demonstrated two practical data poisoning attacks (split-view and frontrunning) against LAION-400M and Common Crawl-style datasets.
arxiv.org/abs/2302.10149 — Carlini et al., 'Poisoning Web-Scale Training Datasets is Practical'
- [07]
Showed that the embedding projection layer of production language models including OpenAI's ada and babbage variants could be recovered via API queries; OpenAI subsequently patched the relevant API surface.
arxiv.org/abs/2403.06634 — Carlini et al., 'Stealing Part of a Production Language Model'
- [08]
Foundational paper establishing model extraction attacks against ML-as-a-service prediction APIs.
USENIX Security 2016 — Tramèr et al., 'Stealing Machine Learning Models via Prediction APIs'
- [09]
Demonstrated gradient-based adversarial suffix attacks that transfer jailbreaks across aligned language models.
arxiv.org/abs/2307.15043 — Zou et al., 'Universal and Transferable Adversarial Attacks on Aligned Language Models'
- [10]
Identified that long-context language models are vulnerable to a class of attack that fills the context with fabricated assistant responses to elicit off-policy behavior.
Anthropic blog · April 2024 · 'Many-shot jailbreaking'
- [11]
A US federal judge sanctioned attorneys who submitted a brief containing six fabricated case citations generated by ChatGPT, the canonical case for hallucination-acted-on harm.
Mata v. Avianca, Inc., S.D.N.Y. 22-cv-1461, 2023
- [12]
Tribunal held Air Canada liable for a bereavement-fare policy its chatbot invented, ruling the airline responsible for its chatbot's misstatements.
Moffatt v. Air Canada · British Columbia Civil Resolution Tribunal · February 2024
- [13]
Demonstrated that GPT-2 memorized and could be made to regurgitate PII and source code from its training set.
arxiv.org/abs/2012.07805 — Carlini et al., 'Extracting Training Data from Large Language Models' (USENIX Security 2021)
- [14]
Anthropic showed that backdoor behaviors deliberately trained into models can persist through subsequent safety training including RLHF and adversarial training.
arxiv.org/abs/2401.05566 — Hubinger et al., 'Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training'
- [15]
Documented instances of frontier models acting differently when they believed they were being observed versus deployed.
Apollo Research · December 2024 · 'Frontier Models are Capable of In-Context Scheming'
- [16]
Published series of data exfiltration demonstrations against ChatGPT plugins, Microsoft 365 Copilot, GitHub Copilot Chat, and Google Gemini, often via markdown image rendering as the exfiltration channel.
embracethered.com — Johann Rehberger's research blog
- [17]
Coined the lethal trifecta framing: an agent with untrusted-content read, private-data access, and external communication is structurally exfiltration-vulnerable.
simonwillison.net · 2025 · 'The lethal trifecta for AI agents'
- [18]
Reported 12.8 million new exposed secrets detected on public GitHub in 2023, with AI provider keys an increasing share of leaks.
GitGuardian · 'State of Secrets Sprawl 2024' report
- [19]
Malicious package uploaded to PyPI with same name as internal PyTorch dependency was installed by users via dependency-confusion attack.
PyTorch security advisory · December 2022 · torchtriton dependency confusion incident
- [20]
Multi-year social engineering campaign against an upstream open-source maintainer planted a backdoor in xz Utils, a sobering supply-chain case for any AI deployer relying on open source.
CVE-2024-3094 · xz Utils backdoor · disclosed March 2024
- [21]
Hugging Face documented pickle-format risks in model files and improved scanning for malicious model uploads.
huggingface.co/blog/safetensors-security-audit
- [22]
Samsung restricted internal use of generative AI tools in May 2023 after engineers reportedly pasted proprietary source code into ChatGPT.
Bloomberg · May 2023 · Samsung ChatGPT ban reporting
- [23]
Found that LLM coding assistants regularly suggest package imports that do not exist, which attackers can register and use to distribute malware.
Lasso Security · 2024 · research on AI package hallucination ('slopsquatting')
- [24]
EU AI Act entered into force August 2024 with phased application of obligations through 2025–2026 depending on system risk tier.
EUR-Lex · Regulation (EU) 2024/1689 (EU AI Act)
- [25]
ISO/IEC 42001:2023 specifies requirements for an AI management system and is certifiable by third-party auditors.
iso.org · ISO/IEC 42001:2023