AtomEons / Learn / calc / tools / jailbreak-checker

::calculator · A rough, educational gauge of how likely a frontier LLM is to refuse your prompt.

Refusal Heuristic Estimator

Frontier LLMs in mid-2026 — Claude 3.5 Sonnet ($3/M input, $15/M output), GPT-4o ($2.50/M input, $10/M output), Gemini 1.5 Pro ($1.25-$5/M input, $5-$15/M output) — all ship with layered refusal behavior. There's the trained policy in the base weights, the system prompt the platform stacks on top, and the runtime classifiers that score your text before and after generation. None of these are publicly documented in detail, and the actual refusal surface drifts every few weeks as providers ship safety updates. This calculator does not measure any of that. It is a heuristic — a simple two-variable estimator that lets you build intuition about why a given prompt might trip a refusal, without pretending to be a real classifier. Two signals dominate the public literature on refusal behavior: whether the prompt contains lexical triggers that pattern-match common policy categories (violence, self-harm, illicit instructions, regulated advice), and how strict the deployment's system prompt is about scope. Prompt length matters too — longer prompts give classifiers more surface area to find something — but the effect is small compared to the other two, so we surface it as context only. The output is a single percentage, refusal risk. Treat it as a directional reading, not a probability. A 60% reading means "this prompt has features that meaningfully raise the chance of refusal across most frontier models"; it does not mean "60 out of 100 generations will refuse." Real refusal is model-specific, version-specific, prompt-context-specific, and stochastic. A prompt that gets refused by Claude 3.5 Sonnet at temperature 0 with a strict system prompt may pass GPT-4o at temperature 0.7 with a permissive system prompt, and vice versa. The intended use is educational: building developer intuition for why benign prompts sometimes refuse and adversarial prompts sometimes don't, and where to put effort if you're tuning a legitimate application's prompt to reduce false-refusal rates. It is not a jailbreak tool, an adversarial harness, or a substitute for the provider's own moderation API. If you need ground truth on whether a specific prompt will refuse, send it to the model and observe.

::inputs

Prompt lengthchars

Total characters in your prompt, including any few-shot examples. Context only — does not drive the score.

Contains trigger words

Lexical match against common policy-category keywords (violence, self-harm, illicit, regulated advice, identity).

System prompt strength

How tightly the deployment's system prompt constrains scope and topic.

::result

Refusal risk

—

::how this calculates

The estimator adds two contributions. Trigger words add 50 points if present, 5 points if absent — reflecting that lexical pattern-matching dominates first-pass refusal classifiers. System prompt strength adds 5 (weak), 15 (standard), or 30 (strict) points, reflecting how much the deployment narrows acceptable output. Prompt length is captured as context for the user but does not enter the formula, because empirically length is a weak third-order signal next to lexical content and system-prompt scope. The two contributions sum to a single refusal risk percentage in the range 10-80.

::worked examples

Benign developer query on a permissive deployment

promptLengthChars: 240triggerWords: nosystemPromptStrength: weak

Short, neutral vocabulary, minimal system prompt. Lands at 10% — the floor of the estimator. This is the typical zone for everyday coding, summarization, and analysis prompts on a developer API key.

Neutral question on a narrowed enterprise deployment

promptLengthChars: 800triggerWords: nosystemPromptStrength: strict

No trigger words, but the system prompt aggressively narrows scope (e.g., 'only answer questions about our HR policies'). Lands at 35%. Most refusals in this zone are scope refusals ('I can only help with X'), not policy refusals.

Trigger-word prompt on a standard chat surface

promptLengthChars: 600triggerWords: yessystemPromptStrength: standard

Lexical match against a policy category plus a default system prompt. Lands at 65%. This is the zone where benign prompts get caught (medical question, security-research context, fiction-writing) and where careful framing matters most.

Trigger-word prompt on a strict enterprise deployment

promptLengthChars: 1500triggerWords: yessystemPromptStrength: strict

Both signals firing. Lands at 80% — the ceiling of the estimator. In practice this prompt will refuse on most frontier models without significant restructuring or platform-level allowlist configuration.

::what this does NOT capture

○Real refusal is model-specific, version-specific, and stochastic. This calculator outputs a single number; production refusal is a distribution over many factors.
○Lexical trigger detection is a coarse stand-in for the multi-classifier pipelines that providers actually run. Modern systems use semantic embeddings, not keyword lists.
○System prompt strength is reduced to three buckets. Real deployments vary continuously and interact with the user prompt in non-linear ways.
○Prompt length is shown as context but excluded from the score. Length does correlate weakly with refusal in some studies, but the effect is dwarfed by lexical content.
○The score range (10-80%) is bounded by design. Real refusal rates can be 0% or 100% for specific prompts; this estimator deliberately avoids those extremes to discourage overconfidence.
○This tool is educational, not adversarial. It builds intuition for developers tuning legitimate applications; it is not a jailbreak harness or a way to probe provider defenses.
○Frontier model refusal behavior drifts. Providers ship safety updates every few weeks. A heuristic calibrated to June 2026 behavior will decay.
○Ground truth requires sending the prompt to the actual model. No estimator substitutes for direct measurement against the provider's API.

← back to /learn·/tools index →