← all repositories
karpathy/arxiv-sanity-lite

Your personal arXiv bouncer, trained on TF-IDF and $5/month

A dead-simple paper recommender that learns what you like from the abstracts you tag, then emails you daily so you stop drowning in preprints.

1.6k stars Python RAG · SearchLLMOps · Eval
arxiv-sanity-lite
Velocity · 7d
+1.0
★ / day
Trend
steady
star history

What it does

arxiv-sanity-lite polls the arXiv API, downloads paper metadata, and lets you tag whatever catches your eye. It then trains an SVM on TF-IDF vectors of paper abstracts per tag, and surfaces similar new papers in a Flask web UI. You can search, sort, and filter results; a daily cron job can email you fresh recommendations via SendGrid. The whole thing runs on a $5/month Linode Nanode indexing about 30K papers.

The interesting bit

The heavy lifting is classic ML — SVMs over TF-IDF, not embeddings or transformers — which makes the compute.py step cheap enough to skip when no new papers arrive. The “lite” in the name is honest: this is a from-scratch rewrite of the original arxiv-sanity, stripped to polling, tagging, and linear classification.

Key highlights

  • Self-hosted, single-directory deployment (data/ holds everything)
  • Cron-friendly update pipeline: arxiv_daemon.py fetches, compute.py featurizes, serve.py hosts
  • Optional daily email digests via send_emails.py + SendGrid
  • Live demo running at arxiv-sanity-lite.com
  • MIT licensed

Caveats

  • Search iterates the full database; no reverse index yet
  • The metadata store uses sqlitedict instead of proper SQLite tables
  • Mobile UI needs media queries (per the author’s own todo list)

Verdict

Good fit if you want a hackable, low-cost paper filter and don’t need semantic search or LLM summaries. Skip it if you’re looking for state-of-the-art NLP or a polished mobile experience.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.