Data Tooling

Data Tooling

heavyweights · gaining speed
01
microsoft/markitdown
+1166 ★/dayaccelerating

A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.

150.4k Python Data Tooling · explained
02
roboflow/supervision
+530 ★/dayaccelerating

A model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.

43.6k Python Computer Vision · explained
03
PaddlePaddle/PaddleOCR
+332 ★/dayaccelerating

PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.

81.8k Python Computer Vision · explained
04
firecrawl/firecrawl
+456 ★/dayaccelerating

Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.

131.2k TypeScript Data Tooling · explained
06
anomalyco/models.dev
+48 ★/dayaccelerating

An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.

4.9k TypeScript Other AI · explained
07
OpenDCAI/DataFlow
+37 ★/dayaccelerating

DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.

4.8k Python Data Tooling · explained
08
opendatalab/MinerU
+132 ★/dayaccelerating

MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.

67.2k Python Data Tooling · explained
09
OpenSenseNova/SenseNova-Skills
+118 ★/dayaccelerating

SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.

4.1k Python Agents · explained
10
openags/paper-search-mcp
+15 ★/dayaccelerating

An MCP server that searches 20+ academic sources and actually tells you when it can't download something instead of hallucinating a PDF.

1.8k Python Coding Assistants · explained
11
jivoi/awesome-ml-for-cybersecurity
+11 ★/dayaccelerating

Someone finally collected all the ML-for-security papers, datasets, and books in one place so you don't have to hunt through conference proceedings at 2 AM.

8.9k Learning · explained
12
ScrapeGraphAI/Scrapegraph-ai
+54 ★/dayaccelerating

A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.

27.1k Python RAG · Search · explained
13
c2g-dev/city2graph
+9.1 ★/dayaccelerating

A Python bridge that turns messy geospatial data—streets, transit feeds, building footprints—into PyTorch Geometric tensors without the usual hand-rolled pain.

1.3k Python Data Tooling · explained
14
kucherenko/jscpd
+6.7 ★/dayaccelerating

A 5.7k-star duplication detector rebuilt itself for the agentic era: token-efficient reporters, MCP server, and skills your AI assistant can actually use.

5.8k TypeScript Coding Assistants · explained
15
adbar/trafilatura
+7.4 ★/dayaccelerating

A Python tool that turns noisy HTML into clean, structured text for NLP pipelines and research corpora.

6.1k Python Data Tooling · explained
16
memgraph/memgraph
+6.4 ★/dayaccelerating

Memgraph wants to be the single database operation your GraphRAG pipeline actually needs.

4.1k C++ RAG · Search · explained
17
hudson-and-thames/mlfinlab
+5.6 ★/dayaccelerating

This repo exists solely for bug reports—because the actual code lives behind a paywall.

4.8k Python Domain Apps · explained
18
meizhong986/WhisperJAV
+5.0 ★/dayaccelerating

A specialized ASR pipeline that treats JAV audio as an adversarial attack on speech recognition and fights back with scene segmentation, defensive decoding, and surgical audio processing.

1.7k Python Inference · Serving · explained
19
OpenBB-finance/OpenBB
+54 ★/dayaccelerating

OpenBB is an open-source data integration layer that normalizes financial data sources so quants, analysts, and AI agents don't have to write a new adapter every Monday.

68.9k Python Domain Apps · explained
20
grobidOrg/grobid
+3.0 ★/dayaccelerating

A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.

4.9k Java Domain Apps · explained
loading more…

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.