::calculator · Estimate how much your text will get blacked out before you run the regex
PII Redactor Spec
::inputs
Total character count of the document you want to redact
Which entity categories the regex + NER pass will target
1.0 = typical business prose; 2.5 = CRM/support transcript; 0.3 = legal contract
Length of the token replacing each match (e.g. [REDACTED] = 10 chars)
Mean length of the original string being replaced (names ~10, emails ~22, SSN = 11)
::result
Estimated redactions found
—
Estimated redacted text length
—
Length change vs original
—
::how this calculates
Each redaction mode has a baseline detection rate measured in matches per 1,000 characters of typical business prose. Full mode catches roughly 1.92 entities per 1,000 chars (driven mostly by personal names at 1.67/1000), financial-only catches roughly 0.18 per 1,000, contact-only catches roughly 0.96 per 1,000. We multiply that rate by your character count and your density profile to get expected matches. Redacted length is then the original length minus the characters consumed by matched entities, plus the characters added back by replacement tokens.
::worked examples
Customer support transcript, full redaction
8000 chars at 2.5x density (CRM transcript) under full-mode catches roughly 38 entities — mostly customer first names and an email or two. Redacted output is about 150 chars shorter.
Legal contract, financial-only
25K-char contract at 0.3x density under financial-only catches roughly 1 entity — perhaps one SSN or one account number. Output length barely moves.
Marketing email list dump, contact-only
50K-char list at 3x density (dense email/phone payload) under contact-only catches roughly 144 entities. Replacement tokens are 12 chars vs 22-char avg matches, so output shrinks by ~1440 chars.
::what this does NOT capture
- ○Density baselines are calibrated to mean rates from Enron-redacted, i2b2, and OntoNotes 5 person-entity distributions — your actual corpus may deviate by 3-5x.
- ○This estimator does NOT detect anything — it only projects expected counts. No source text is parsed or stored by the estimator.
- ○Person-name detection in real deployment will use NER, which runs 0.85-0.92 F1 on clean prose and lower on transcripts, code, or non-English text — actual recall will undershoot the estimate.
- ○Replacement token length is fixed; if your implementation uses variable-length sentinels (e.g. [REDACTED-EMAIL-001]) the length delta will not match.
- ○Average matched entity length defaults to 14 chars, which is a weighted mean across SSN (11), phone (12), email (22), and name (10) — adjust if your scope is single-category.
- ○Financial-only and contact-only baselines exclude personal names entirely, which is why their density rates are an order of magnitude lower than full mode.
- ○This is not a compliance attestation. HIPAA Safe Harbor, GDPR Article 4(5), and CCPA require validated detection workflows with documented precision/recall, not density estimates.
- ○Density multiplier is a blunt scalar — it scales all categories uniformly. In reality, a legal contract is low on emails but high on names and addresses; the multiplier cannot capture that asymmetry.