Persian NLP that handles the zero-width non-joiner so you don't have to
A Python toolkit for Persian text processing, from normalization to dependency parsing, with models fetched automatically from Hugging Face.

What it does
Hazm is a Python library for processing Persian (Farsi) text. It covers the standard NLP pipeline—normalization, tokenization, lemmatization, POS tagging, chunking, dependency parsing, and word/sentence embeddings—plus utilities for reading popular Persian corpora. Models download and cache automatically from Hugging Face Hub.
The interesting bit
Persian text normalization is fiddly work: diacritics, half-spaces, and the zero-width non-joiner (ZWNJ) all need correction before anything else works. Hazm bakes this in as a first-class step rather than leaving it as an exercise. The lemmatizer also returns compound roots (e.g., نوشت#نویس for “مینویسیم”), which is more informative than a simple stem.
Key highlights
- POS tagger hits 98.8% accuracy; dependency parser at 85.6% on the project’s own evaluation
- Hugging Face integration means no manual model downloads—just pass
repo_idandmodel_filename - Supports FastText word embeddings and sentence vectors via
sent2vec - Includes ready-made corpus readers for common Persian datasets
- Requires Python 3.12+
Caveats
- The README lists both legacy and “Spacy” prefixed modules (SpacyPOSTagger, SpacyChunker, etc.) with different metrics, but doesn’t clarify whether these are spaCy wrappers or independent implementations
- Dependency parser output is raw nested dictionaries, not a graph object—usable, but you’ll do your own traversal
Verdict
Worth a look if you’re building Persian-language pipelines and want batteries-included preprocessing. Skip it if you’re already invested in spaCy or transformers and prefer to roll your own Persian normalization.