← all repositories
evanhu1/talk2arxiv

Turn any arXiv URL into a research chatbot with a two-letter prefix

A RAG pipeline that lets you interrogate academic papers by swapping 'arxiv' for 'talk2arxiv' in the URL.

527 stars TypeScript RAG · SearchChat Assistants
talk2arxiv
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does Talk2Arxiv ingests any arXiv PDF, chunks it by logical sections (abstract, intro, etc.) with recursive subdivision as fallback, embeds the chunks via Cohere, and drops them into Qdrant. Users then chat with the paper through a Next.js frontend. The whole thing is triggered by prepending ’talk2’ to an arxiv.org URL — a neat URL hack that sidesteps upload forms entirely.

The interesting bit The chunking strategy is the quietly thoughtful part: it tries semantic boundaries first, then recursively halves down to 128-character chunks when sections run long. The README also notes they’re eyeing a move to LaTeX source extraction, which would sidestep PDF parsing headaches for math formulas — a common RAG weak spot that most projects paper over.

Key highlights

  • URL-based activation: arxiv.orgtalk2arxiv.org — no file uploads, no search
  • GROBID for PDF parsing, Cohere EmbedV3 for embeddings, Qdrant for vector storage
  • Papers are cached after first embedding; repeat queries are free
  • Reranking step between retrieval and generation for relevance tuning
  • Open-source frontend + separate Flask backend

Caveats

  • Backend is explicitly single-threaded; concurrent requests will stall it dead
  • No mention of cost controls or rate limits on the Cohere/Qdrant calls
  • Math and non-standard text elements remain a known weakness until LaTeX source extraction lands

Verdict Grad students who want to sanity-check a paper without reading all 47 pages should try the URL trick. Anyone needing reliable uptime or concurrent team access should wait for the backend rewrite — or self-host and patch the threading themselves.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.