Data Tooling

big names · picking up speed

+526 ★/day↗accelerating

A Python utility that converts office documents and media into structured Markdown built for LLM pipelines, not human eyeballs.

★ 169.2k Python Data Tooling · explained Feature

unclecode/crawl4ai

+259 ★/day↗accelerating

Built in a fit of rage after a $16 “open source” web-to-Markdown tool gated features behind API tokens.

★ 75.1k Python Data Tooling · explained Feature

PDFMathTranslate/PDFMathTranslate

+36 ★/day↗accelerating

It translates scientific papers into bilingual PDFs while keeping formulas, charts, and annotations exactly where they belong.

★ 35.8k Python Other AI · explained

firecrawl/firecrawl

+475 ★/day↗accelerating

Firecrawl turns search, scraping, and browser interaction into a single API so your agents can read the web without wrestling with proxies, rate limits, or JavaScript rendering.

★ 156.5k TypeScript Data Tooling · explained Feature

yamadashy/repomix

+29 ★/day↗accelerating

Repomix exists because copy-pasting twenty files into a chat window is a terrible way to ask an LLM for help.

★ 27.4k TypeScript Data Tooling · explained

fighting41love/funNLP

+25 ★/day↗accelerating

A maintainer cataloged every Chinese NLP repo they touched into a single, obsessively categorized list so others wouldn’t have to hunt.

★ 82.1k Python Learning · explained

PicoTrex/Awesome-Nano-Banana-images

+9.9 ★/day↗accelerating

It collects the best social-media experiments with a Gemini-2.5-flash-image derivative and releases a 150k identity-consistent dataset for the community.

★ 23.3k Image · Video · Audio · explained

chroma-core/chroma

+7.9 ★/day→steady

Chroma is an open-source search backend that handles the messy embedding pipeline so AI applications can store and retrieve documents with a minimal API.

★ 28.9k Rust RAG · Search · explained

HumanSignal/label-studio

+6.9 ★/day→steady

Because someone has to label the training data, and it might as well not be in a spreadsheet.

★ 27.9k TypeScript Data Tooling · explained

ScrapeGraphAI/Scrapegraph-ai

+23 ★/day↘cooling

ScrapeGraphAI lets you extract structured data from websites and documents by describing what you want in plain English, leaving the LLM to wrestle with the markup.

★ 28.6k Python RAG · Search · explained

PaddlePaddle/PaddleOCR

+68 ★/day↘cooling

It turns images and PDFs into structured JSON and Markdown so your RAG pipeline doesn't have to squint.

★ 86.3k Python Computer Vision · explained Feature

toon-format/toon

+9.9 ★/day↘cooling

It re-encodes JSON into a token-cheaper, schema-explicit format so you can fit more context into LLM prompts without losing structure.

★ 25k TypeScript LLMOps · Eval · explained

OpenBB-finance/OpenBB

+38 ★/day↘cooling

OpenBB normalizes proprietary and public financial data so engineers can feed the same sources to Python scripts, REST APIs, Excel, and AI agents without rebuilding integrations.

★ 71k Python Domain Apps · explained

roboflow/supervision

+34 ★/day↘cooling

It exists to handle the tedious wiring—annotations, dataset formats, tracking—that sits between a trained model and a useful application.

★ 48.4k Python Computer Vision · explained Feature

academic/awesome-datascience

+5.3 ★/day↘cooling

A curated awesome-list that tries to answer "What is Data Science, and what should I study?" by cataloging courses, tools, libraries, and communities in a single sprawling index.

★ 29.7k Learning · explained

opendataloader-project/opendataloader-pdf

+57 ★/day↘cooling

OpenDataLoader PDF exists to extract structured data from PDFs for AI pipelines while auto-tagging untagged documents for screen readers, all without proprietary dependencies.

★ 27.9k Java Data Tooling · explained

docling-project/docling

+49 ★/day↘cooling

Docling turns PDFs, Office files, images, and even audio into structured AI-ready formats, entirely on your own hardware.

★ 63.8k Python Data Tooling · explained

opendatalab/MinerU

+96 ★/day↘cooling

MinerU turns PDFs, Office files, and images into structured Markdown and JSON so LLM agents don’t drown in layout noise.

★ 75.8k Python Data Tooling · explained

google/langextract

+45 ★/day↘cooling

LangExtract exists because asking an LLM to pull names and dates out of a report is easy; proving exactly which sentence each came from is the hard part.

★ 37.9k Python Data Tooling · explained