Regex can't see a face. This local redactor can.
It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.

What it does
RedactionEverything is a local-first document anonymization workbench. It ingests plain text, Word files, images, and scanned PDFs, then detects sensitive entities—names, addresses, faces, seals, signatures, and IDs—using a mix of semantic NER, OCR, and visual grounding models. A built-in review interface lets humans correct detections before batch-exporting redacted packages, keeping raw files inside the local network.
The interesting bit
Instead of relying on brittle regex, it routes documents through a text path (HaS Text semantic NER over OCR output) and a visual path (LocateAnything-3B for faces, cards, and seals, plus OpenCV for red ink edges). A particularly clever pass whitens seal ink to recover text crushed underneath, then deduplicates the results—addressing the reality that Chinese contracts and scanned forms rarely present clean, linear text.
Key highlights
- Runs entirely offline on a local or intranet GPU workstation; no raw files hit a remote API.
- Uses semantic NER (
HaS Text) as the default detection method, with regex only as a user-defined fallback. LocateAnything-3Bgrounds visual presets like faces, fingerprints, signatures, and QR codes, supplemented by an OpenCV detector for red binding seals.- Configurable schemas cover general, legal, finance, and healthcare domains with domain-specific labels.
- Batch task management with human-in-the-loop review, progress tracking, and packaged export workflows.
Caveats
- The license is a custom Personal Use License; commercial use, teams, and production deployments require a separate paid license, and some bundled components (
LocateAnything-3Bweights, PyMuPDF) carry their own non-commercial or copyleft terms. - The full vision pipeline recommends 16 GB of VRAM and an NVIDIA GPU; running model services on CPU is treated as a fallback risk, not a supported mode.
- The README implies a complex local setup—multiple Python virtual environments, WSL path handling, and manual weight downloads—even for the “one-command” startup.
Verdict
Worth a look if you handle sensitive scanned archives, bilingual contracts, or compliance-heavy documents in an air-gapped environment. Skip it if you need a simple SaaS PII scrubber or if wrangling local vLLM and PaddleOCR stacks sounds like someone else’s job.
Frequently asked
- What is TracyWang95/DataInfra-RedactionEverything?
- It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.
- Is DataInfra-RedactionEverything open source?
- Yes — TracyWang95/DataInfra-RedactionEverything is an open-source project tracked on heatdrop.
- What language is DataInfra-RedactionEverything written in?
- TracyWang95/DataInfra-RedactionEverything is primarily written in TypeScript.
- How popular is DataInfra-RedactionEverything?
- TracyWang95/DataInfra-RedactionEverything has 1k stars on GitHub.
- Where can I find DataInfra-RedactionEverything?
- TracyWang95/DataInfra-RedactionEverything is on GitHub at https://github.com/TracyWang95/DataInfra-RedactionEverything.