::calculator · A rough, educational gauge of how likely a frontier LLM is to refuse your prompt.
Refusal Heuristic Estimator
::inputs
Total characters in your prompt, including any few-shot examples. Context only — does not drive the score.
Lexical match against common policy-category keywords (violence, self-harm, illicit, regulated advice, identity).
How tightly the deployment's system prompt constrains scope and topic.
::result
Refusal risk
—
::how this calculates
The estimator adds two contributions. Trigger words add 50 points if present, 5 points if absent — reflecting that lexical pattern-matching dominates first-pass refusal classifiers. System prompt strength adds 5 (weak), 15 (standard), or 30 (strict) points, reflecting how much the deployment narrows acceptable output. Prompt length is captured as context for the user but does not enter the formula, because empirically length is a weak third-order signal next to lexical content and system-prompt scope. The two contributions sum to a single refusal risk percentage in the range 10-80.
::worked examples
Benign developer query on a permissive deployment
Short, neutral vocabulary, minimal system prompt. Lands at 10% — the floor of the estimator. This is the typical zone for everyday coding, summarization, and analysis prompts on a developer API key.
Neutral question on a narrowed enterprise deployment
No trigger words, but the system prompt aggressively narrows scope (e.g., 'only answer questions about our HR policies'). Lands at 35%. Most refusals in this zone are scope refusals ('I can only help with X'), not policy refusals.
Trigger-word prompt on a standard chat surface
Lexical match against a policy category plus a default system prompt. Lands at 65%. This is the zone where benign prompts get caught (medical question, security-research context, fiction-writing) and where careful framing matters most.
Trigger-word prompt on a strict enterprise deployment
Both signals firing. Lands at 80% — the ceiling of the estimator. In practice this prompt will refuse on most frontier models without significant restructuring or platform-level allowlist configuration.
::what this does NOT capture
- ○Real refusal is model-specific, version-specific, and stochastic. This calculator outputs a single number; production refusal is a distribution over many factors.
- ○Lexical trigger detection is a coarse stand-in for the multi-classifier pipelines that providers actually run. Modern systems use semantic embeddings, not keyword lists.
- ○System prompt strength is reduced to three buckets. Real deployments vary continuously and interact with the user prompt in non-linear ways.
- ○Prompt length is shown as context but excluded from the score. Length does correlate weakly with refusal in some studies, but the effect is dwarfed by lexical content.
- ○The score range (10-80%) is bounded by design. Real refusal rates can be 0% or 100% for specific prompts; this estimator deliberately avoids those extremes to discourage overconfidence.
- ○This tool is educational, not adversarial. It builds intuition for developers tuning legitimate applications; it is not a jailbreak harness or a way to probe provider defenses.
- ○Frontier model refusal behavior drifts. Providers ship safety updates every few weeks. A heuristic calibrated to June 2026 behavior will decay.
- ○Ground truth requires sending the prompt to the actual model. No estimator substitutes for direct measurement against the provider's API.