A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.
Data Tooling
newcomers · velocity + momentum一个Markdown文件让Claude Code能查K线、研报、龙虎榜,零第三方依赖。
Firecrawl turns the messy web into clean markdown and structured data so your AI agents don't have to squint at HTML.
This tool turns any PDF or EPUB into a Claude Code skill, so you can query frameworks and patterns from the actual text instead of hallucinating chapter 7.
A new serialization format that trades braces for whitespace and turns uniform arrays into schema-aware tables, cutting token counts by ~40% without losing the JSON data model.
LangExtract turns wall-of-text documents into structured, verifiable data by making the LLM show its work.
Run-LLama's Rust-core tool extracts text, bounding boxes, and screenshots locally, with an escape hatch to cloud OCR when documents get nasty.
The most-starred crawler on GitHub exists because its creator refused to pay $16 for a bad API.
Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.
A community-curated gallery showing off the weird, useful, and surprisingly specific tricks possible with Gemini-2.5-flash-image.
SenseNova-Skills bundles concrete office capabilities—slide decks, data analysis, infographics, and deep research—as modular agent plugins you drop into OpenClaw or Hermes.
MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.
A CLI that points Playwright at any URL and emits Tailwind configs, Figma variables, shadcn themes, and even graded report cards.
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.
A tool that keeps formulas, charts, and layout intact while translating scientific papers into 34K+ stars worth of languages.
A Python toolkit that reverse-engineers alpha-blended logos, strips C2PA manifests, and diffuses away invisible fingerprints like SynthID.
Pathway lets you write ETL pipelines in Python, then executes them in a Rust engine built on Differential Dataflow.
ktx is a local context layer that ingests your data stack and business knowledge so Claude, Codex, and other agents query warehouses with approved metrics instead of inventing SQL.
OpenLake wants storage to bypass the host entirely and land straight in GPU memory.
Repomix collapses entire codebases into a single AI-friendly file, because context windows are hungry and copy-pasting is undignified.

