← all repositories
explosion/spacy-stanza

Stanford's NLP models, finally speaking spaCy

A compatibility wrapper that lets you drop Stanza's research-grade multilingual pipelines into spaCy's ecosystem without rewriting your code.

748 stars Python Other AI
spacy-stanza
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

spacy-stanza is a bridge, not a model. It wraps Stanford’s Stanza library so its tokenization, tagging, parsing, and NER outputs populate standard spaCy Doc objects. You call spacy_stanza.load_pipeline("en") and get back something that behaves like any other spaCy nlp object — displaCy visualizations, custom components, nlp.pipe, the lot.

The interesting bit

The clever part is where the work happens: everything runs inside a custom StanzaTokenizer, which means Stanza’s full pipeline executes at tokenization time and stuffs all annotations (lemmas, dependencies, entities) into the Doc before downstream components even see it. It’s a bit of a hack, but it lets you bolt on spaCy-specific tools — say, an EntityRuler or text classifier — on top of Stanza’s outputs.

Key highlights

  • Supports 68+ languages with Stanza’s pretrained models; falls back to spaCy’s xx language class when spaCy lacks dedicated support
  • Full spaCy API compatibility: doc.ents, token.dep_, displacy, custom pipeline components, serialization via nlp.to_disk()
  • Stanza pipeline options (language packages, pretokenized input, GPU use) pass through as keyword arguments or spaCy config blocks
  • spaCy v3.x only; v2.x users must pin to spacy-stanza<0.3.0

Caveats

  • Serialization saves pipeline config but not Stanza model weights — you must re-download models separately via stanza.download()
  • Tokenization swap to spaCy’s own tokenizer is limited to English only
  • Stanza models are “very large” (README’s words, not mine), so this is not a lightweight deployment option

Verdict

Worth a look if you need Stanza’s multilingual accuracy or CoNLL-winning parsers but can’t abandon your spaCy-based tooling. Skip it if you’re building from scratch and don’t need both ecosystems; the wrapper adds friction and model bloat for no gain.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.