← all repositories
inspirehep/magpie

CERN's multi-label classifier: physics abstracts to keywords

A Keras wrapper that trains word2vec embeddings, then slaps labels on text—born from sorting High Energy Physics papers.

686 stars Python ML FrameworksLanguage Models
magpie
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Magpie is a Python wrapper around a Keras neural network for multi-label text classification. You feed it pairs of .txt files and .lab files, it builds word2vec embeddings, normalizes with a scaler, and trains a model to predict multiple labels per document. It was built at CERN to auto-categorize physics abstracts and extract keywords.

The interesting bit

The whole pipeline is deliberately chunked into swappable pieces: word2vec, scaler, Keras model. You can pre-train embeddings on your full corpus, save them, and hot-swap later. There’s also a batch_train() mode when your data won’t fit in RAM—unusual thoughtfulness for a small research tool.

Key highlights

  • Built on Yoon Kim’s CNN-for-text architecture, adapted by Mark Berger’s follow-up work
  • Three-file format: .txt for text, .lab for labels (one per line), matched by filename
  • init_word_vectors() combines word2vec training + scaler fitting in one call
  • batch_train() for memory-constrained training
  • Not on PyPI; install via pip install git+https://... with dependency version gotchas

Caveats

  • Last tagged release is v2.1.1; unclear how actively maintained
  • Dependency versions are finicky enough that the README warns about checking setup.py
  • No GPU guidance, no modern embedding options (BERT, etc.)—this is firmly word2vec-era

Verdict

Grab it if you need a quick, hackable multi-label baseline with inspectable word embeddings. Skip it if you want SOTA or a batteries-included library—this is research glue code that happens to be well-documented glue code.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.