← all repositories
jina-ai/reader

A web scraper that speaks fluent LLM

Prefix a URL with https://r.jina.ai/ and get back clean markdown your language model can actually use.

11.1k stars TypeScript RAG · SearchData ToolingLLMOps · Eval
reader
Velocity · 7d
+14
★ / day
Trend
steady
star history

What it does Reader is a hosted service (with an open-source core) that fetches web pages, PDFs, Office documents, and even images, then converts them into LLM-friendly markdown. The simplest use is almost insultingly simple: prepend https://r.jina.ai/ to any URL. There’s also a search endpoint at s.jina.ai that runs a web query, fetches the top five results, and converts those too — not just search snippets, but the full page content.

The interesting bit The real craft is in the control surface. Reader exposes a dense grid of HTTP headers — x-respond-timing lets you trade latency against completeness (return at raw HTML, or wait for network idle), x-max-tokens truncates to fit your context window, x-token-budget rejects if the result would overflow. It’s a scraper designed by people who’ve actually hit token limits and paid for overruns.

Key highlights

  • Dual fetching: headless Chrome for JS-heavy sites, curl-impersonate for lightweight pages, with auto-selection between them
  • PDF/Word/Excel/PowerPoint ingestion via PDF.js and LibreOffice, plus direct file upload by POST since December 2025
  • Image captioning via vision-language model so text-only LLMs get hints about visual content
  • Optional MinIO/S3 bucket caching in the open-source branch; the SaaS adds MongoDB-backed storage not included here
  • Semantic markdown chunking via headers or block-level splits, for direct feeding into RAG pipelines

Caveats

  • The open-source branch is stateless by default; the MongoDB storage layer from the hosted service is stripped out
  • Rate limits apply on the free hosted API, though the README is vague on exact numbers

Verdict Anyone building agents or RAG systems that need to ingest the live web should look here. If your use case is already covered by static documents in S3, it’s probably overkill.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.