Is privacy-filter open source?

Yes — openai/privacy-filter is open source, released under the Apache-2.0 license.

What language is privacy-filter written in?

openai/privacy-filter is primarily written in Python.

How popular is privacy-filter?

openai/privacy-filter has 2.4k stars on GitHub.

Where can I find privacy-filter?

openai/privacy-filter is on GitHub at https://github.com/openai/privacy-filter.

← all repositories

openai/privacy-filter

OpenAI's PII filter: small enough for a laptop, paranoid enough for prod

A 1.5B-parameter on-premise model that detects and masks personal data in a single forward pass, with tunable precision/recall tradeoffs.

★2.4k stars Python Other AI

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Privacy Filter is a local, Apache 2.0-licensed tool for detecting and redacting personally identifiable information in text. It runs entirely on-premises via a CLI (opf) that supports one-shot redaction, batch file processing, piping from other tools, interactive mode, evaluation against labeled data, and fine-tuning on custom datasets. The model detects 8 categories: account numbers, addresses, emails, person names, phone numbers, URLs, dates, and secrets.

The interesting bit

The architecture is a transformer that started life as an autoregressive language model (similar to gpt-oss) and was surgically converted into a bidirectional token classifier. Instead of generating text, it labels every token in one forward pass, then runs a constrained Viterbi decoder to enforce coherent BIOES span boundaries. The 128K-token context window means you can throw entire documents at it without chunking.

Key highlights

1.5B parameters total, 50M active — runs on CPU or GPU, even in a browser
128K context window, no chunking required
Tunable operating points for precision/recall tradeoffs at runtime
Fine-tunable on custom labeled data; CLI includes train, eval, and redact modes
Constrained Viterbi decoding with 6 transition-bias parameters for span coherence

Caveats

Static label policy: you cannot add new PII categories at runtime; retraining required
Primarily English; performance drops on non-English text, non-Latin scripts, and out-of-domain data
Explicitly not a compliance or anonymization guarantee — OpenAI warns against over-reliance, especially in medical, legal, or financial contexts
Known failure modes: under-detection of uncommon names and regional conventions; over-redaction of public entities; fragmented boundaries in messy text

Verdict

Worth evaluating if you need on-premise PII detection with a permissive license and don’t want to pipe sensitive data to an API. Skip it if you need dynamic label policies, guaranteed compliance, or strong multilingual coverage without fine-tuning.

Frequently asked

What is openai/privacy-filter?: A 1.5B-parameter on-premise model that detects and masks personal data in a single forward pass, with tunable precision/recall tradeoffs.
Is privacy-filter open source?: Yes — openai/privacy-filter is open source, released under the Apache-2.0 license.
What language is privacy-filter written in?: openai/privacy-filter is primarily written in Python.
How popular is privacy-filter?: openai/privacy-filter has 2.4k stars on GitHub.
Where can I find privacy-filter?: openai/privacy-filter is on GitHub at https://github.com/openai/privacy-filter.