Yes — PrincetonML/SIF is open source, released under the MIT license.

What language is SIF written in?

PrincetonML/SIF is primarily written in Python.

PrincetonML/SIF has 1.1k stars on GitHub.

Where can I find SIF?

PrincetonML/SIF is on GitHub at https://github.com/PrincetonML/SIF.

← all repositories

PrincetonML/SIF

Sentence embeddings that humiliate your fancy neural net

A 2017 baseline for sentence embeddings that still kicks around modern models, implemented in a few lines of Python.

★1.1k stars Python Language Models ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

SIF (Smooth Inverse Frequency) generates sentence embeddings by averaging word vectors with a dead-simple twist: downweight common words like “the” and “a” using inverse frequency, then subtract the principal component to remove shared “noise.” The paper calls it “simple but tough-to-beat,” and the code lives up to the first half — the core weighting scheme is a handful of lines.

The interesting bit

The trick isn’t the neural architecture; it’s the statistical hack. SIF treats sentences as a weighted bag of words, removes their common direction, and somehow competes with supervised RNNs and LSTMs. The authors published this at ICLR 2017, and the README still frames it as a baseline worth checking before you reach for transformers.

Key highlights

Core algorithm fits in a few lines of Python (SIF_embedding.py)
Ships with demos for textual similarity and supervised projection tasks
Uses pretrained GloVe vectors; no training required for the basic embedding
Includes evaluation scripts and preprocessing pipelines from related work
Dependencies are a time capsule: Theano, Lasagne, and Python 2-era stack

Caveats

Dependencies (Theano, Lasagne) are effectively deprecated; getting this running in 2024 may require archaeology
README notes the code borrows preprocessing from a 2016 codebase, so the full pipeline isn’t self-contained

Verdict

Worth a look if you’re building sentence embeddings and need a fast, interpretable baseline to humble your fancier model. Skip it if you want production-ready code or modern GPU acceleration — this is research archaeology, not a framework.

Frequently asked

What is PrincetonML/SIF?: A 2017 baseline for sentence embeddings that still kicks around modern models, implemented in a few lines of Python.
Is SIF open source?: Yes — PrincetonML/SIF is open source, released under the MIT license.
What language is SIF written in?: PrincetonML/SIF is primarily written in Python.
How popular is SIF?: PrincetonML/SIF has 1.1k stars on GitHub.
Where can I find SIF?: PrincetonML/SIF is on GitHub at https://github.com/PrincetonML/SIF.