CERN's multi-label classifier: physics abstracts to keywords
A Keras wrapper that trains word2vec embeddings, then slaps labels on text—born from sorting High Energy Physics papers.

What it does
Magpie is a Python wrapper around a Keras neural network for multi-label text classification. You feed it pairs of .txt files and .lab files, it builds word2vec embeddings, normalizes with a scaler, and trains a model to predict multiple labels per document. It was built at CERN to auto-categorize physics abstracts and extract keywords.
The interesting bit
The whole pipeline is deliberately chunked into swappable pieces: word2vec, scaler, Keras model. You can pre-train embeddings on your full corpus, save them, and hot-swap later. There’s also a batch_train() mode when your data won’t fit in RAM—unusual thoughtfulness for a small research tool.
Key highlights
- Built on Yoon Kim’s CNN-for-text architecture, adapted by Mark Berger’s follow-up work
- Three-file format:
.txtfor text,.labfor labels (one per line), matched by filename init_word_vectors()combines word2vec training + scaler fitting in one callbatch_train()for memory-constrained training- Not on PyPI; install via
pip install git+https://...with dependency version gotchas
Caveats
- Last tagged release is v2.1.1; unclear how actively maintained
- Dependency versions are finicky enough that the README warns about checking
setup.py - No GPU guidance, no modern embedding options (BERT, etc.)—this is firmly word2vec-era
Verdict
Grab it if you need a quick, hackable multi-label baseline with inspectable word embeddings. Skip it if you want SOTA or a batteries-included library—this is research glue code that happens to be well-documented glue code.