A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.
Data Tooling
heavyweights · gaining speedA model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.
PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.
Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.
An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.
DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.
MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.
SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.
An MCP server that searches 20+ academic sources and actually tells you when it can't download something instead of hallucinating a PDF.
Someone finally collected all the ML-for-security papers, datasets, and books in one place so you don't have to hunt through conference proceedings at 2 AM.
A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.
A Python bridge that turns messy geospatial data—streets, transit feeds, building footprints—into PyTorch Geometric tensors without the usual hand-rolled pain.
A 5.7k-star duplication detector rebuilt itself for the agentic era: token-efficient reporters, MCP server, and skills your AI assistant can actually use.
A Python tool that turns noisy HTML into clean, structured text for NLP pipelines and research corpora.
Memgraph wants to be the single database operation your GraphRAG pipeline actually needs.
This repo exists solely for bug reports—because the actual code lives behind a paywall.
A specialized ASR pipeline that treats JAV audio as an adversarial attack on speech recognition and fights back with scene segmentation, defensive decoding, and surgical audio processing.
OpenBB is an open-source data integration layer that normalizes financial data sources so quants, analysts, and AI agents don't have to write a new adapter every Monday.
A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.


