A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.
Data Tooling
newcomers · velocity + momentumA model-agnostic Python toolkit that handles the boring parts of computer vision: annotations, dataset juggling, and tracking.
Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.
PaddleOCR turns scans and PDFs into structured Markdown or JSON using a tiny vision-language model that punches above its weight class.
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.
This tool turns any PDF or EPUB into a Claude Code skill, so you can query frameworks and patterns from the actual text instead of hallucinating chapter 7.
SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.
Run-LLama's Rust-core tool extracts text, bounding boxes, and screenshots locally, with an escape hatch to cloud OCR when documents get nasty.
MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.
The most-starred crawler on GitHub exists because its creator refused to pay $16 for a bad API.
一个Markdown文件让Claude Code能查K线、研报、龙虎榜,零第三方依赖。
Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.
An open-source, community-maintained database that tracks AI model specs, pricing, and capabilities so you don't have to scrape provider docs.
A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.
OpenBB is an open-source data integration layer that normalizes financial data sources so quants, analysts, and AI agents don't have to write a new adapter every Monday.
A Python toolkit that reverse-engineers alpha-blended logos, strips C2PA manifests, and diffuses away invisible fingerprints like SynthID.
ktx is a local context layer that ingests your data stack and business knowledge so Claude, Codex, and other agents query warehouses with approved metrics instead of inventing SQL.
A tool that keeps formulas, charts, and layout intact while translating scientific papers into 34K+ stars worth of languages.
DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.
A pipeline that turns messy PDFs and slides into structured, navigable memory for AI agents instead of flat text shards.

