OpenAI's PII filter: small enough for a laptop, paranoid enough for prod
A 1.5B-parameter on-premise model that detects and masks personal data in a single forward pass, with tunable precision/recall tradeoffs.

What it does
Privacy Filter is a local, Apache 2.0-licensed tool for detecting and redacting personally identifiable information in text. It runs entirely on-premises via a CLI (opf) that supports one-shot redaction, batch file processing, piping from other tools, interactive mode, evaluation against labeled data, and fine-tuning on custom datasets. The model detects 8 categories: account numbers, addresses, emails, person names, phone numbers, URLs, dates, and secrets.
The interesting bit
The architecture is a transformer that started life as an autoregressive language model (similar to gpt-oss) and was surgically converted into a bidirectional token classifier. Instead of generating text, it labels every token in one forward pass, then runs a constrained Viterbi decoder to enforce coherent BIOES span boundaries. The 128K-token context window means you can throw entire documents at it without chunking.
Key highlights
- 1.5B parameters total, 50M active — runs on CPU or GPU, even in a browser
- 128K context window, no chunking required
- Tunable operating points for precision/recall tradeoffs at runtime
- Fine-tunable on custom labeled data; CLI includes
train,eval, andredactmodes - Constrained Viterbi decoding with 6 transition-bias parameters for span coherence
Caveats
- Static label policy: you cannot add new PII categories at runtime; retraining required
- Primarily English; performance drops on non-English text, non-Latin scripts, and out-of-domain data
- Explicitly not a compliance or anonymization guarantee — OpenAI warns against over-reliance, especially in medical, legal, or financial contexts
- Known failure modes: under-detection of uncommon names and regional conventions; over-redaction of public entities; fragmented boundaries in messy text
Verdict
Worth evaluating if you need on-premise PII detection with a permissive license and don’t want to pipe sensitive data to an API. Skip it if you need dynamic label policies, guaranteed compliance, or strong multilingual coverage without fine-tuning.