← all repositories
babylonhealth/fastText_multilingual

Make 78 languages share one vector space, no neural training required

Pre-computed linear transforms let you compare "chat" and "кот" directly in fastText space.

1.2k stars Jupyter Notebook Language Models
fastText_multilingual
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

Facebook’s fastText word vectors are monolingual: “chat” and “кот” (French and Russian for “cat”) sit in unrelated vector spaces, with cosine similarity near zero. This repo ships 78 pre-computed alignment matrices that rotate each language’s vectors into English space via simple linear transforms. Apply a matrix, and cross-lingual nearest neighbors suddenly work.

The interesting bit

The trick is an orthogonal transformation learned from SVD on a small bilingual dictionary—just 5,000 word pairs translated via Google Translate API. Because every language aligns to English (English gets the identity matrix), the system generalizes to language pairs it never saw paired directly. French-to-Russian translation works even though the training data was French-English and English-Russian.

Key highlights

  • 78 pre-computed alignment matrices for fastText’s original 89-language set
  • Monolingual similarity relationships preserved exactly—no distortion within a language
  • Precision@1 hits 0.73 for French, 0.72 for Spanish, 0.60 for Russian; drops toward 0.06 for low-resource languages like Cebuano
  • Includes align_your_own.ipynb to learn custom matrices for new language pairs
  • Based on ICLR 2017 paper: Offline bilingual word vectors, orthogonal transformations and the inverted softmax

Caveats

  • Babylon Health no longer maintains the repo; paper authors are the contact for issues
  • Google Translate API coverage limits language selection (11 of Facebook’s original languages excluded)
  • Performance degrades sharply for non-European and low-resource languages

Verdict

Worth a look if you need quick cross-lingual word similarity without training a multilingual model from scratch. Skip it if you need sentence-level embeddings or state-of-the-art performance on distant language pairs—this is word-level, 2017-era technology.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.