A web scraper that speaks fluent LLM
Prefix a URL with https://r.jina.ai/ and get back clean markdown your language model can actually use.

What it does
Reader is a hosted service (with an open-source core) that fetches web pages, PDFs, Office documents, and even images, then converts them into LLM-friendly markdown. The simplest use is almost insultingly simple: prepend https://r.jina.ai/ to any URL. There’s also a search endpoint at s.jina.ai that runs a web query, fetches the top five results, and converts those too — not just search snippets, but the full page content.
The interesting bit
The real craft is in the control surface. Reader exposes a dense grid of HTTP headers — x-respond-timing lets you trade latency against completeness (return at raw HTML, or wait for network idle), x-max-tokens truncates to fit your context window, x-token-budget rejects if the result would overflow. It’s a scraper designed by people who’ve actually hit token limits and paid for overruns.
Key highlights
- Dual fetching: headless Chrome for JS-heavy sites,
curl-impersonatefor lightweight pages, with auto-selection between them - PDF/Word/Excel/PowerPoint ingestion via PDF.js and LibreOffice, plus direct file upload by POST since December 2025
- Image captioning via vision-language model so text-only LLMs get hints about visual content
- Optional MinIO/S3 bucket caching in the open-source branch; the SaaS adds MongoDB-backed storage not included here
- Semantic markdown chunking via headers or block-level splits, for direct feeding into RAG pipelines
Caveats
- The open-source branch is stateless by default; the MongoDB storage layer from the hosted service is stripped out
- Rate limits apply on the free hosted API, though the README is vague on exact numbers
Verdict Anyone building agents or RAG systems that need to ingest the live web should look here. If your use case is already covered by static documents in S3, it’s probably overkill.