Data Tooling

Data Tooling

newcomers · gaining speed
01
microsoft/markitdown
+1166 ★/dayaccelerating

A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.

150.4k Python Data Tooling · explained
02
roboflow/supervision
+530 ★/dayaccelerating

A model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.

43.6k Python Computer Vision · explained
03
PaddlePaddle/PaddleOCR
+332 ★/dayaccelerating

PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.

81.8k Python Computer Vision · explained
04
firecrawl/firecrawl
+456 ★/dayaccelerating

Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.

131.2k TypeScript Data Tooling · explained
06
anomalyco/models.dev
+48 ★/dayaccelerating

An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.

4.9k TypeScript Other AI · explained
07
OpenDCAI/DataFlow
+37 ★/dayaccelerating

DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.

4.8k Python Data Tooling · explained
08
openags/paper-search-mcp
+15 ★/dayaccelerating

An MCP server that searches 20+ academic sources and actually tells you when it can't download something instead of hallucinating a PDF.

1.8k Python Coding Assistants · explained
09
jivoi/awesome-ml-for-cybersecurity
+11 ★/dayaccelerating

Someone finally collected all the ML-for-security papers, datasets, and books in one place so you don't have to hunt through conference proceedings at 2 AM.

8.9k Learning · explained
10
c2g-dev/city2graph
+9.1 ★/dayaccelerating

A Python bridge that turns messy geospatial data—streets, transit feeds, building footprints—into PyTorch Geometric tensors without the usual hand-rolled pain.

1.3k Python Data Tooling · explained
11
kucherenko/jscpd
+6.7 ★/dayaccelerating

A 5.7k-star duplication detector rebuilt itself for the agentic era: token-efficient reporters, MCP server, and skills your AI assistant can actually use.

5.8k TypeScript Coding Assistants · explained
12
adbar/trafilatura
+7.4 ★/dayaccelerating

A Python tool that turns noisy HTML into clean, structured text for NLP pipelines and research corpora.

6.1k Python Data Tooling · explained
13
memgraph/memgraph
+6.4 ★/dayaccelerating

Memgraph wants to be the single database operation your GraphRAG pipeline actually needs.

4.1k C++ RAG · Search · explained
14
hudson-and-thames/mlfinlab
+5.6 ★/dayaccelerating

This repo exists solely for bug reports—because the actual code lives behind a paywall.

4.8k Python Domain Apps · explained
15
meizhong986/WhisperJAV
+5.0 ★/dayaccelerating

A specialized ASR pipeline that treats JAV audio as an adversarial attack on speech recognition and fights back with scene segmentation, defensive decoding, and surgical audio processing.

1.7k Python Inference · Serving · explained
16
grobidOrg/grobid
+3.0 ★/dayaccelerating

A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.

4.9k Java Domain Apps · explained
20
pemistahl/lingua-rs
+1.0 ★/dayaccelerating

Rust library that identifies 75 languages from single words up to long documents, no neural networks or API calls required.

1.1k Rust Data Tooling · explained
loading more…

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.