The spaCy sidekick that cleans your text and counts your Flesch-Kincaid
A Python library for the NLP grunt work that happens before tokenization and after parsing.

What it does
textacy wraps spaCy with higher-level helpers for the parts of an NLP pipeline that spaCy doesn’t touch: loading datasets, cleaning raw text, extracting structured info like keyterms and SVO triples, building topic models, and computing readability scores. It’s essentially the glue and utility layer between “we have text” and “we have vectors.”
The interesting bit
The library ships with ready-made datasets — Congressional speeches, historical literature, Reddit comments — which is rarer than it should be in NLP tooling. It also extends spaCy’s Doc objects with custom methods, so you can call .to_bag_of_terms() or .to_semantic_network() directly on a parsed document rather than juggling converters yourself.
Key highlights
- Pre- and post-processing around spaCy: cleaning, normalization, n-gram extraction, acronym detection, keyterm ranking
- Built-in datasets with metadata (no scraping required)
- Topic modeling pipeline: tokenization, vectorization, training, visualization
- Readability and lexical diversity stats, including multilingual Flesch Reading Ease
- String/sequence similarity metrics beyond what spaCy provides
Caveats
- The README is light on specifics: no version requirements, no performance notes, no comparison to alternatives like
spacy-transformersorgensim - “…and much more!” suggests breadth over depth; you’ll need to dig into the docs to see what’s actually well-supported
Verdict
Worth a look if you’re building spaCy-based pipelines and tired of rewriting text-cleaning boilerplate. Skip it if you need cutting-edge neural models or fine-grained control over every preprocessing step — this is convenience tooling, not research infrastructure.