← all repositories
PrincetonML/SIF

Sentence embeddings that humiliate your fancy neural net

A 2017 baseline for sentence embeddings that still kicks around modern models, implemented in a few lines of Python.

1.1k stars Python Language ModelsML Frameworks
SIF
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

SIF (Smooth Inverse Frequency) generates sentence embeddings by averaging word vectors with a dead-simple twist: downweight common words like “the” and “a” using inverse frequency, then subtract the principal component to remove shared “noise.” The paper calls it “simple but tough-to-beat,” and the code lives up to the first half — the core weighting scheme is a handful of lines.

The interesting bit

The trick isn’t the neural architecture; it’s the statistical hack. SIF treats sentences as a weighted bag of words, removes their common direction, and somehow competes with supervised RNNs and LSTMs. The authors published this at ICLR 2017, and the README still frames it as a baseline worth checking before you reach for transformers.

Key highlights

  • Core algorithm fits in a few lines of Python (SIF_embedding.py)
  • Ships with demos for textual similarity and supervised projection tasks
  • Uses pretrained GloVe vectors; no training required for the basic embedding
  • Includes evaluation scripts and preprocessing pipelines from related work
  • Dependencies are a time capsule: Theano, Lasagne, and Python 2-era stack

Caveats

  • Dependencies (Theano, Lasagne) are effectively deprecated; getting this running in 2024 may require archaeology
  • README notes the code borrows preprocessing from a 2016 codebase, so the full pipeline isn’t self-contained

Verdict

Worth a look if you’re building sentence embeddings and need a fast, interpretable baseline to humble your fancier model. Skip it if you want production-ready code or modern GPU acceleration — this is research archaeology, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.