← all repositories
chartbeat-labs/textacy

The spaCy sidekick that cleans your text and counts your Flesch-Kincaid

A Python library for the NLP grunt work that happens before tokenization and after parsing.

textacy
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

textacy wraps spaCy with higher-level helpers for the parts of an NLP pipeline that spaCy doesn’t touch: loading datasets, cleaning raw text, extracting structured info like keyterms and SVO triples, building topic models, and computing readability scores. It’s essentially the glue and utility layer between “we have text” and “we have vectors.”

The interesting bit

The library ships with ready-made datasets — Congressional speeches, historical literature, Reddit comments — which is rarer than it should be in NLP tooling. It also extends spaCy’s Doc objects with custom methods, so you can call .to_bag_of_terms() or .to_semantic_network() directly on a parsed document rather than juggling converters yourself.

Key highlights

  • Pre- and post-processing around spaCy: cleaning, normalization, n-gram extraction, acronym detection, keyterm ranking
  • Built-in datasets with metadata (no scraping required)
  • Topic modeling pipeline: tokenization, vectorization, training, visualization
  • Readability and lexical diversity stats, including multilingual Flesch Reading Ease
  • String/sequence similarity metrics beyond what spaCy provides

Caveats

  • The README is light on specifics: no version requirements, no performance notes, no comparison to alternatives like spacy-transformers or gensim
  • “…and much more!” suggests breadth over depth; you’ll need to dig into the docs to see what’s actually well-supported

Verdict

Worth a look if you’re building spaCy-based pipelines and tired of rewriting text-cleaning boilerplate. Skip it if you need cutting-edge neural models or fine-grained control over every preprocessing step — this is convenience tooling, not research infrastructure.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.