A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.
Data Tooling
newcomers · gaining speedA model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.
PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.
Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.
An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.
DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.
An MCP server that searches 20+ academic sources and actually tells you when it can't download something instead of hallucinating a PDF.
Someone finally collected all the ML-for-security papers, datasets, and books in one place so you don't have to hunt through conference proceedings at 2 AM.
A Python bridge that turns messy geospatial data—streets, transit feeds, building footprints—into PyTorch Geometric tensors without the usual hand-rolled pain.
A 5.7k-star duplication detector rebuilt itself for the agentic era: token-efficient reporters, MCP server, and skills your AI assistant can actually use.
A Python tool that turns noisy HTML into clean, structured text for NLP pipelines and research corpora.
Memgraph wants to be the single database operation your GraphRAG pipeline actually needs.
This repo exists solely for bug reports—because the actual code lives behind a paywall.
A specialized ASR pipeline that treats JAV audio as an adversarial attack on speech recognition and fights back with scene segmentation, defensive decoding, and surgical audio processing.
A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.
An opinionated reading list, arXiv radar, and link farm for anyone trying to break into quantitative trading—especially in Chinese markets.
A curated index of datasets, APIs, and AI competitions specifically for game research.
Go CLI that replaces guesswork with an interactive picker for model quantizations, branches, and diffusers components.
Rust library that identifies 75 languages from single words up to long documents, no neural networks or API calls required.



