Is DataInfra-RedactionEverything open source?

Yes — TracyWang95/DataInfra-RedactionEverything is an open-source project tracked on heatdrop.

What language is DataInfra-RedactionEverything written in?

TracyWang95/DataInfra-RedactionEverything is primarily written in TypeScript.

How popular is DataInfra-RedactionEverything?

TracyWang95/DataInfra-RedactionEverything has 1k stars on GitHub.

Where can I find DataInfra-RedactionEverything?

TracyWang95/DataInfra-RedactionEverything is on GitHub at https://github.com/TracyWang95/DataInfra-RedactionEverything.

← all repositories

TracyWang95/DataInfra-RedactionEverything

Regex can't see a face. This local redactor can.

It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.

★1k stars TypeScript Data Tooling

View on GitHub ↗

Collecting fresh signals — velocity needs a few days of history.

collecting data…

star history

What it does

RedactionEverything is a local-first document anonymization workbench. It ingests plain text, Word files, images, and scanned PDFs, then detects sensitive entities—names, addresses, faces, seals, signatures, and IDs—using a mix of semantic NER, OCR, and visual grounding models. A built-in review interface lets humans correct detections before batch-exporting redacted packages, keeping raw files inside the local network.

The interesting bit

Instead of relying on brittle regex, it routes documents through a text path (HaS Text semantic NER over OCR output) and a visual path (LocateAnything-3B for faces, cards, and seals, plus OpenCV for red ink edges). A particularly clever pass whitens seal ink to recover text crushed underneath, then deduplicates the results—addressing the reality that Chinese contracts and scanned forms rarely present clean, linear text.

Key highlights

Runs entirely offline on a local or intranet GPU workstation; no raw files hit a remote API.
Uses semantic NER (HaS Text) as the default detection method, with regex only as a user-defined fallback.
LocateAnything-3B grounds visual presets like faces, fingerprints, signatures, and QR codes, supplemented by an OpenCV detector for red binding seals.
Configurable schemas cover general, legal, finance, and healthcare domains with domain-specific labels.
Batch task management with human-in-the-loop review, progress tracking, and packaged export workflows.

Caveats

The license is a custom Personal Use License; commercial use, teams, and production deployments require a separate paid license, and some bundled components (LocateAnything-3B weights, PyMuPDF) carry their own non-commercial or copyleft terms.
The full vision pipeline recommends 16 GB of VRAM and an NVIDIA GPU; running model services on CPU is treated as a fallback risk, not a supported mode.
The README implies a complex local setup—multiple Python virtual environments, WSL path handling, and manual weight downloads—even for the “one-command” startup.

Verdict

Worth a look if you handle sensitive scanned archives, bilingual contracts, or compliance-heavy documents in an air-gapped environment. Skip it if you need a simple SaaS PII scrubber or if wrangling local vLLM and PaddleOCR stacks sounds like someone else’s job.

Frequently asked

What is TracyWang95/DataInfra-RedactionEverything?: It exists to find and mask sensitive content in messy real-world documents—scanned PDFs, Word files, images—using local vision and language models, because regex never saw a seal.
Is DataInfra-RedactionEverything open source?: Yes — TracyWang95/DataInfra-RedactionEverything is an open-source project tracked on heatdrop.
What language is DataInfra-RedactionEverything written in?: TracyWang95/DataInfra-RedactionEverything is primarily written in TypeScript.
How popular is DataInfra-RedactionEverything?: TracyWang95/DataInfra-RedactionEverything has 1k stars on GitHub.
Where can I find DataInfra-RedactionEverything?: TracyWang95/DataInfra-RedactionEverything is on GitHub at https://github.com/TracyWang95/DataInfra-RedactionEverything.