A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.
Data Tooling
heavyweights · velocity + momentumFirecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.
一个Markdown文件让Claude Code能查K线、研报、龙虎榜,零第三方依赖。
This tool turns any PDF or EPUB into a Claude Code skill, so you can query frameworks and patterns from the actual text instead of hallucinating chapter 7.
LangExtract turns wall-of-text documents into structured, verifiable data by making the LLM show its work.
A new serialization format that trades braces for whitespace and turns uniform arrays into schema-aware tables, cutting token counts by ~40% without losing the JSON data model.
The most-starred crawler on GitHub exists because its creator refused to pay $16 for a bad API.
Run-LLama's Rust-core tool extracts text, bounding boxes, and screenshots locally, with an escape hatch to cloud OCR when documents get nasty.
Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.
A community-curated gallery showing off the weird, useful, and surprisingly specific tricks possible with Gemini-2.5-flash-image.
SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.
MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.
A CLI that points Playwright at any URL and emits Tailwind configs, Figma variables, shadcn themes, and even graded report cards.
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.
A tool that keeps formulas, charts, and layout intact while translating scientific papers into 34K+ stars worth of languages.
Pathway lets you write ETL pipelines in Python, then executes them in a Rust engine built on Differential Dataflow.
A Python toolkit that reverse-engineers alpha-blended logos, strips C2PA manifests, and diffuses away invisible fingerprints like SynthID.
ktx is a local context layer that ingests your data stack and business knowledge so Claude, Codex, and other agents query warehouses with approved metrics instead of inventing SQL.
Repomix collapses entire codebases into a single AI-friendly file, because context windows are hungry and copy-pasting is undignified.
OpenLake wants storage to bypass the host entirely and land straight in GPU memory.

