← all repositories
inukshuk/anystyle

A citation parser that learned to read bibliographies so you don't have to

AnyStyle uses machine learning to turn messy reference strings into structured data, with a focus on letting you train it on your own weird formatting conventions.

1.3k stars Ruby Data Tooling
anystyle
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does AnyStyle parses free-form bibliographic references—think copy-pasted citation strings from PDFs or web pages—into structured fields like author, title, date, and publisher. It ships as a Ruby gem, a CLI tool, and powers the web app at anystyle.io. The parser handles a Derrida citation in French as readily as an English journal article, extracting language and script metadata along the way.

The interesting bit The project doesn’t pretend one model fits all citation styles. It exposes training pipelines so you can build custom models on your own annotated data, checking quality against held-out “gold” sets. The default model is trained on a manually curated corpus, but the README is admirably upfront about its skew: 965 English references versus 54 French and a grab bag of others. They practically beg you to retrain if you’re working outside Anglophone science publishing.

Key highlights

  • CLI, Ruby API, and open-source web interface (anystyle.io)
  • Custom model training with anystyle train and quality checking via sequence/token error rates
  • Supports Latin scripts broadly, plus Cyrillic; explicitly incompatible with Chinese, Japanese, Arabic, and Indian languages that don’t whitespace-separate tokens
  • Pluggable dictionary backends: in-memory Ruby hash, GDBM, or Redis
  • BSD-licensed, volunteer-maintained since 2011

Caveats

  • The default training data is heavily English-biased; non-English results may need custom models
  • Finder model training data is partially withheld due to copyright restrictions
  • No candidate images available in the repository

Verdict Worth a look if you’re building bibliographic pipelines, cleaning up reference dumps, or need citation parsing you can retrain for domain-specific formats. Skip it if you’re processing CJK or Arabic script natively, or if you need a Python-native solution—this is Ruby territory.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.