The original ELMo: when "contextualized" was still novel
TensorFlow 1.2 implementation of the bidirectional language model that made word embeddings depend on their neighbors.

What it does
This is AllenAI’s reference implementation of ELMo—deep contextualized word representations—built in TensorFlow. It trains bidirectional language models and exposes three inference modes: raw character-level encoding (slowest, most general), cached token embeddings with biLSTM context (middle ground), or pre-computing entire datasets to HDF5 (fastest, most rigid).
The interesting bit
The three-speed design is the practical hook. The README is refreshingly honest about trade-offs: character-level for unseen test data, cached tokens for fixed vocabularies like SNLI, full pre-computation when you want to escape TensorFlow entirely. It’s a snapshot of 2018 NLP engineering before transformers swallowed everything.
Key highlights
- Ships with pre-trained English models and the exact 1 Billion Word Benchmark setup used in the original paper
- Character-based inputs mean OOV words still get representations, though with “a slight decrease in run time”
- Training script
bin/train_elmo.pydocuments the original hyperparameters: 3 GTX 1080s, 10 epochs, ~two weeks - Includes checkpoint-to-HDF5 conversion for interoperability with AllenNLP’s PyTorch implementation
- Docker image available, but requires
nvidia-docker—no CPU fallback
Caveats
- Locked to TensorFlow 1.2, which is now archaeological; the README itself nudges users toward TensorFlow Hub or AllenNLP for new work
- Vocabulary file must be sorted by descending token frequency with exact
<S>,</S>,<UNK>placement—brittle convention, easy to botch - Fine-tuning on small corpora (<10M tokens) risks overfitting; the authors explicitly warn against training too long
Verdict
Worth studying if you’re tracing the evolution of contextual embeddings or reproducing 2018 baselines. For production use, follow the README’s own advice and use the TensorFlow Hub or AllenNLP versions instead.