Stanford's NLP models, finally speaking spaCy
A compatibility wrapper that lets you drop Stanza's research-grade multilingual pipelines into spaCy's ecosystem without rewriting your code.
What it does
spacy-stanza is a bridge, not a model. It wraps Stanford’s Stanza library so its tokenization, tagging, parsing, and NER outputs populate standard spaCy Doc objects. You call spacy_stanza.load_pipeline("en") and get back something that behaves like any other spaCy nlp object — displaCy visualizations, custom components, nlp.pipe, the lot.
The interesting bit
The clever part is where the work happens: everything runs inside a custom StanzaTokenizer, which means Stanza’s full pipeline executes at tokenization time and stuffs all annotations (lemmas, dependencies, entities) into the Doc before downstream components even see it. It’s a bit of a hack, but it lets you bolt on spaCy-specific tools — say, an EntityRuler or text classifier — on top of Stanza’s outputs.
Key highlights
- Supports 68+ languages with Stanza’s pretrained models; falls back to spaCy’s
xxlanguage class when spaCy lacks dedicated support - Full spaCy API compatibility:
doc.ents,token.dep_,displacy, custom pipeline components, serialization vianlp.to_disk() - Stanza pipeline options (language packages, pretokenized input, GPU use) pass through as keyword arguments or spaCy config blocks
- spaCy v3.x only; v2.x users must pin to
spacy-stanza<0.3.0
Caveats
- Serialization saves pipeline config but not Stanza model weights — you must re-download models separately via
stanza.download() - Tokenization swap to spaCy’s own tokenizer is limited to English only
- Stanza models are “very large” (README’s words, not mine), so this is not a lightweight deployment option
Verdict
Worth a look if you need Stanza’s multilingual accuracy or CoNLL-winning parsers but can’t abandon your spaCy-based tooling. Skip it if you’re building from scratch and don’t need both ecosystems; the wrapper adds friction and model bloat for no gain.