← all repositories
INESCTEC/yake

Keyword extraction without the training-data treadmill

A statistical keyword extractor that works on single documents with zero training or external corpora.

1.9k stars Jupyter Notebook Data ToolingRAG · Search
yake
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

What it does YAKE! pulls keywords from a single document using only statistical features—word frequency, position, casing, and co-occurrence patterns. No neural models, no pre-trained embeddings, no labeled datasets. You feed it text; it returns scored n-grams. Lower scores mean higher relevance.

The interesting bit The method is deliberately collection-independent: it derives everything from the document itself. This makes it portable across languages and domains without retraining, a rarity in an era where most NLP tools ship with gigabyte-sized model weights. It won Best Short Paper at ECIR 2018, suggesting the academics found the simplicity defensible.

Key highlights

  • Unsupervised: no training data, no external dictionaries, no corpus statistics
  • Single-document focus: each document is self-contained; no batching required
  • Multilingual: supports multiple languages via language parameter (Portuguese shown in docs)
  • Configurable deduplication: Levenshtein, Jaro, or sequence matcher to suppress near-duplicate phrases
  • Optional lemmatization (v0.6.0+) to collapse morphological variants like “tree/trees”
  • Includes a TextHighlighter utility for marking keywords in HTML output

Caveats

  • The README doesn’t quantify accuracy or compare against modern embedding-based methods (e.g., KeyBERT); effectiveness on long or highly technical documents is unclear
  • “Language and domain independent” is claimed but not benchmarked across domains in the provided docs
  • Command-line help contains a typo (“deduplication limiar” instead of “limit”)

Verdict Useful for quick prototyping, low-resource environments, or when you can’t ship a transformer model. Skip it if you need state-of-the-art precision and have the GPU cycles for supervised alternatives.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.