Data Tooling

Data Tooling

newcomers · velocity + momentum
01
microsoft/markitdown
+1166 ★/dayaccelerating

A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.

150.4k Python Data Tooling · explained
02
roboflow/supervision
+530 ★/dayaccelerating

A model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.

43.6k Python Computer Vision · explained
03
firecrawl/firecrawl
+456 ★/dayaccelerating

Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.

131.2k TypeScript Data Tooling · explained
04
PaddlePaddle/PaddleOCR
+332 ★/dayaccelerating

PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.

81.8k Python Computer Vision · explained
06
virgiliojr94/book-to-skill
+142 ★/dayaccelerating

This tool turns any PDF or EPUB into a Claude Code skill, so you can query frameworks and patterns from the actual text instead of hallucinating chapter 7.

4.8k Python Coding Assistants · explained
07
OpenSenseNova/SenseNova-Skills
+118 ★/dayaccelerating

SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.

4.1k Python Agents · explained
08
run-llama/liteparse
+118 ★/dayaccelerating

Run-LLama's Rust-core tool extracts text, bounding boxes, and screenshots locally, with an escape hatch to cloud OCR when documents get nasty.

9.8k Rust Data Tooling · explained
09
opendatalab/MinerU
+132 ★/dayaccelerating

MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.

67.2k Python Data Tooling · explained
10
unclecode/crawl4ai
+78 ★/daycooling

The most-starred crawler on GitHub exists because its creator refused to pay $16 for a bad API.

68.2k Python Data Tooling · explained Feature
12
docling-project/docling
+64 ★/daycooling

Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.

61.3k Python Data Tooling · explained
13
anomalyco/models.dev
+48 ★/dayaccelerating

An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.

4.9k TypeScript Other AI · explained
14
ScrapeGraphAI/Scrapegraph-ai
+54 ★/dayaccelerating

A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.

27.1k Python RAG · Search · explained
15
OpenBB-finance/OpenBB
+54 ★/dayaccelerating

OpenBB is an open-source data integration layer that normalizes financial data sources so quants, analysts, and AI agents don't have to write a new adapter every Monday.

68.9k Python Domain Apps · explained
16
wiltodelta/remove-ai-watermarks
+39 ★/daycooling

A Python toolkit that reverse-engineers alpha-blended logos, strips C2PA manifests, and diffuses away invisible fingerprints like SynthID.

3.2k Python Computer Vision · explained
17
Kaelio/ktx
+35 ★/daysteady

ktx is a local context layer that ingests your data stack and business knowledge so Claude, Codex, and other agents query warehouses with approved metrics instead of inventing SQL.

1.1k TypeScript Agents · explained
18

A tool that keeps formulas, charts, and layout intact while translating scientific papers into 34K+ stars worth of languages.

34.7k Python Other AI · explained
19
OpenDCAI/DataFlow
+37 ★/dayaccelerating

DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.

4.8k Python Data Tooling · explained
20
Ontos-AI/knowhere
+29 ★/daysteady

A pipeline that turns messy PDFs and slides into structured, navigable memory for AI agents instead of flat text shards.

1.2k Python RAG · Search · explained
loading more…

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.