← all repositories
allenai/bilm-tf

The original ELMo: when "contextualized" was still novel

TensorFlow 1.2 implementation of the bidirectional language model that made word embeddings depend on their neighbors.

1.6k stars Python Language ModelsML Frameworks
bilm-tf
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

This is AllenAI’s reference implementation of ELMo—deep contextualized word representations—built in TensorFlow. It trains bidirectional language models and exposes three inference modes: raw character-level encoding (slowest, most general), cached token embeddings with biLSTM context (middle ground), or pre-computing entire datasets to HDF5 (fastest, most rigid).

The interesting bit

The three-speed design is the practical hook. The README is refreshingly honest about trade-offs: character-level for unseen test data, cached tokens for fixed vocabularies like SNLI, full pre-computation when you want to escape TensorFlow entirely. It’s a snapshot of 2018 NLP engineering before transformers swallowed everything.

Key highlights

  • Ships with pre-trained English models and the exact 1 Billion Word Benchmark setup used in the original paper
  • Character-based inputs mean OOV words still get representations, though with “a slight decrease in run time”
  • Training script bin/train_elmo.py documents the original hyperparameters: 3 GTX 1080s, 10 epochs, ~two weeks
  • Includes checkpoint-to-HDF5 conversion for interoperability with AllenNLP’s PyTorch implementation
  • Docker image available, but requires nvidia-docker—no CPU fallback

Caveats

  • Locked to TensorFlow 1.2, which is now archaeological; the README itself nudges users toward TensorFlow Hub or AllenNLP for new work
  • Vocabulary file must be sorted by descending token frequency with exact <S>, </S>, <UNK> placement—brittle convention, easy to botch
  • Fine-tuning on small corpora (<10M tokens) risks overfitting; the authors explicitly warn against training too long

Verdict

Worth studying if you’re tracing the evolution of contextual embeddings or reproducing 2018 baselines. For production use, follow the README’s own advice and use the TensorFlow Hub or AllenNLP versions instead.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.