← all repositories
neuml/paperai

Bulk LLM reports over scientific paper corpora

A configuration-driven tool that runs hundreds of RAG prompts against indexed research papers and spits out structured reports.

1.8k stars Python RAG · SearchDomain Apps
paperai
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

What it does

paperai ingests scientific paper databases (pre-built with its sibling project paperetl), indexes them with vector embeddings, then runs bulk LLM inference via YAML configuration files. You define columns, queries, and prompt templates; it generates Markdown, CSV, or annotated PDF reports across the top-N matching articles. Think of it as a spreadsheet where some columns are filled by metadata and others by a medical LLM reading the relevant sections for you.

The interesting bit

The “generated columns” feature is the clever hook: each column can run its own retrieval query to rank document sections, then pipe the top matches into a RAG pipeline with a custom LLM prompt. This lets one report ask “what’s the sample size?” while another column independently asks “list possible causes” — all against the same paper, with context windows managed automatically.

Key highlights

  • Built on txtai embeddings + SQLite + LLM; not reinventing the wheel, wiring it together for research workflows
  • YAML report schemas define bulk operations: vector queries, standard metadata columns, and dynamic LLM-generated columns
  • Outputs Markdown, CSV, or annotated PDFs with answers overlaid on originals
  • Ships with CLI shell, single-query runner, and report builder entry points
  • Docker support; Python 3.10+; Colab notebooks for medical research examples

Caveats

  • Requires a separate paperetl step to build the input database; not a one-command end-to-end pipeline
  • The README’s architecture diagram has light/dark mode variants that may render oddly depending on GitHub’s current theme

Verdict

Worth a look if you’re doing systematic literature reviews, meta-analyses, or any repeatable extraction over large paper corpora. Skip it if you just need ad-hoc chat-with-PDF; this is for batch operations at scale.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.