← all repositories
roshan-research/hazm

Persian NLP that handles the zero-width non-joiner so you don't have to

A Python toolkit for Persian text processing, from normalization to dependency parsing, with models fetched automatically from Hugging Face.

1.4k stars Python ML FrameworksRAG · Search
hazm
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Hazm is a Python library for processing Persian (Farsi) text. It covers the standard NLP pipeline—normalization, tokenization, lemmatization, POS tagging, chunking, dependency parsing, and word/sentence embeddings—plus utilities for reading popular Persian corpora. Models download and cache automatically from Hugging Face Hub.

The interesting bit

Persian text normalization is fiddly work: diacritics, half-spaces, and the zero-width non-joiner (ZWNJ) all need correction before anything else works. Hazm bakes this in as a first-class step rather than leaving it as an exercise. The lemmatizer also returns compound roots (e.g., نوشت#نویس for “می‌نویسیم”), which is more informative than a simple stem.

Key highlights

  • POS tagger hits 98.8% accuracy; dependency parser at 85.6% on the project’s own evaluation
  • Hugging Face integration means no manual model downloads—just pass repo_id and model_filename
  • Supports FastText word embeddings and sentence vectors via sent2vec
  • Includes ready-made corpus readers for common Persian datasets
  • Requires Python 3.12+

Caveats

  • The README lists both legacy and “Spacy” prefixed modules (SpacyPOSTagger, SpacyChunker, etc.) with different metrics, but doesn’t clarify whether these are spaCy wrappers or independent implementations
  • Dependency parser output is raw nested dictionaries, not a graph object—usable, but you’ll do your own traversal

Verdict

Worth a look if you’re building Persian-language pipelines and want batteries-included preprocessing. Skip it if you’re already invested in spaCy or transformers and prefer to roll your own Persian normalization.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.