AtomEons / Learn / calc / tools / redact

::calculator · Estimate how much your text will get blacked out before you run the regex

PII Redactor Spec

This is a spec for a client-side PII redactor — not a hosted service. The intent is that the real implementation runs entirely in your browser using regex pattern matching plus lightweight named-entity extraction, never sending the source text over the network. This estimator exists for a narrower purpose: to tell you, before you paste anything sensitive in, roughly how many redactions to expect and how much shorter the output will be. The model behind the estimate is density-based, not detection-based. We assume a typical "human business prose" corpus where personally identifying information appears at empirically observed rates: roughly 1 SSN per 12,000 characters, 1 email per 1,800 characters, 1 phone number per 2,500 characters, and 1 personal name per 600 characters. These are mean rates pulled from public PII benchmark corpora (Enron-redacted, the i2b2 deidentification challenges, and the OntoNotes 5 person-entity distribution). Your actual text will deviate — a customer support transcript will hit names and emails far above baseline; a legal contract will hit names and addresses but rarely emails. Three redaction modes are available. The full mode (ssn-email-phone-name) catches the four canonical categories most regulatory regimes care about. Financial-only narrows to SSN, account numbers, and routing identifiers — the HIPAA/PCI overlap. Contact-only narrows to email, phone, and physical address — the GDPR/CCPA contactable-identifier slice. The redacted length output assumes each detected entity is replaced with a fixed-width token like `[REDACTED]` (10 characters). If your implementation uses a shorter sentinel (e.g. `***`) or a longer one (e.g. `[REDACTED-EMAIL-001]`) the output length will scale linearly with that choice. This is an estimator, not a guarantee. Real-world precision and recall depend entirely on the regex strictness and NER model quality you ship — and named-entity recognition for person names runs around 0.85-0.92 F1 on clean prose, lower on noisy data. Use the numbers below for capacity planning and UX preview, not for compliance attestation.

::inputs

Text lengthchars

Total character count of the document you want to redact

Redaction scope

Which entity categories the regex + NER pass will target

Density profilex baseline

1.0 = typical business prose; 2.5 = CRM/support transcript; 0.3 = legal contract

Replacement token lengthchars

Length of the token replacing each match (e.g. [REDACTED] = 10 chars)

Average matched entity lengthchars

Mean length of the original string being replaced (names ~10, emails ~22, SSN = 11)

::result

Estimated redactions found

—

Estimated redacted text length

—

Length change vs original

—

::how this calculates

Each redaction mode has a baseline detection rate measured in matches per 1,000 characters of typical business prose. Full mode catches roughly 1.92 entities per 1,000 chars (driven mostly by personal names at 1.67/1000), financial-only catches roughly 0.18 per 1,000, contact-only catches roughly 0.96 per 1,000. We multiply that rate by your character count and your density profile to get expected matches. Redacted length is then the original length minus the characters consumed by matched entities, plus the characters added back by replacement tokens.

::worked examples

Customer support transcript, full redaction

textLength: 8000redactTypes: ssn-email-phone-namedensityMultiplier: 2.5sentinelLength: 10avgEntityLength: 14

8000 chars at 2.5x density (CRM transcript) under full-mode catches roughly 38 entities — mostly customer first names and an email or two. Redacted output is about 150 chars shorter.

Legal contract, financial-only

textLength: 25000redactTypes: financial-onlydensityMultiplier: 0.3sentinelLength: 10avgEntityLength: 14

25K-char contract at 0.3x density under financial-only catches roughly 1 entity — perhaps one SSN or one account number. Output length barely moves.

Marketing email list dump, contact-only

textLength: 50000redactTypes: contact-onlydensityMultiplier: 3sentinelLength: 12avgEntityLength: 22

50K-char list at 3x density (dense email/phone payload) under contact-only catches roughly 144 entities. Replacement tokens are 12 chars vs 22-char avg matches, so output shrinks by ~1440 chars.

::what this does NOT capture

○Density baselines are calibrated to mean rates from Enron-redacted, i2b2, and OntoNotes 5 person-entity distributions — your actual corpus may deviate by 3-5x.
○This estimator does NOT detect anything — it only projects expected counts. No source text is parsed or stored by the estimator.
○Person-name detection in real deployment will use NER, which runs 0.85-0.92 F1 on clean prose and lower on transcripts, code, or non-English text — actual recall will undershoot the estimate.
○Replacement token length is fixed; if your implementation uses variable-length sentinels (e.g. [REDACTED-EMAIL-001]) the length delta will not match.
○Average matched entity length defaults to 14 chars, which is a weighted mean across SSN (11), phone (12), email (22), and name (10) — adjust if your scope is single-category.
○Financial-only and contact-only baselines exclude personal names entirely, which is why their density rates are an order of magnitude lower than full mode.
○This is not a compliance attestation. HIPAA Safe Harbor, GDPR Article 4(5), and CCPA require validated detection workflows with documented precision/recall, not density estimates.
○Density multiplier is a blunt scalar — it scales all categories uniformly. In reality, a legal contract is low on emails but high on names and addresses; the multiplier cannot capture that asymmetry.

← back to /learn·/tools index →