← all repositories
TracyWang95/DataInfra-RedactionEverything

Regex can't see a face. This local redactor can.

It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.

1k stars TypeScript Data Tooling
DataInfra-RedactionEverything
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does

RedactionEverything is a local-first document anonymization workbench. It ingests plain text, Word files, images, and scanned PDFs, then detects sensitive entities—names, addresses, faces, seals, signatures, and IDs—using a mix of semantic NER, OCR, and visual grounding models. A built-in review interface lets humans correct detections before batch-exporting redacted packages, keeping raw files inside the local network.

The interesting bit

Instead of relying on brittle regex, it routes documents through a text path (HaS Text semantic NER over OCR output) and a visual path (LocateAnything-3B for faces, cards, and seals, plus OpenCV for red ink edges). A particularly clever pass whitens seal ink to recover text crushed underneath, then deduplicates the results—addressing the reality that Chinese contracts and scanned forms rarely present clean, linear text.

Key highlights

  • Runs entirely offline on a local or intranet GPU workstation; no raw files hit a remote API.
  • Uses semantic NER (HaS Text) as the default detection method, with regex only as a user-defined fallback.
  • LocateAnything-3B grounds visual presets like faces, fingerprints, signatures, and QR codes, supplemented by an OpenCV detector for red binding seals.
  • Configurable schemas cover general, legal, finance, and healthcare domains with domain-specific labels.
  • Batch task management with human-in-the-loop review, progress tracking, and packaged export workflows.

Caveats

  • The license is a custom Personal Use License; commercial use, teams, and production deployments require a separate paid license, and some bundled components (LocateAnything-3B weights, PyMuPDF) carry their own non-commercial or copyleft terms.
  • The full vision pipeline recommends 16 GB of VRAM and an NVIDIA GPU; running model services on CPU is treated as a fallback risk, not a supported mode.
  • The README implies a complex local setup—multiple Python virtual environments, WSL path handling, and manual weight downloads—even for the “one-command” startup.

Verdict

Worth a look if you handle sensitive scanned archives, bilingual contracts, or compliance-heavy documents in an air-gapped environment. Skip it if you need a simple SaaS PII scrubber or if wrangling local vLLM and PaddleOCR stacks sounds like someone else’s job.

Frequently asked

What is TracyWang95/DataInfra-RedactionEverything?
It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.
Is DataInfra-RedactionEverything open source?
Yes — TracyWang95/DataInfra-RedactionEverything is an open-source project tracked on heatdrop.
What language is DataInfra-RedactionEverything written in?
TracyWang95/DataInfra-RedactionEverything is primarily written in TypeScript.
How popular is DataInfra-RedactionEverything?
TracyWang95/DataInfra-RedactionEverything has 1k stars on GitHub.
Where can I find DataInfra-RedactionEverything?
TracyWang95/DataInfra-RedactionEverything is on GitHub at https://github.com/TracyWang95/DataInfra-RedactionEverything.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.