Skip to content

Hazm: Persian NLP Toolkit

Hazm is a comprehensive Python library for processing the Persian language. It provides tools for text normalization, sentence and word tokenization, stemming, lemmatization, part-of-speech tagging, syntactic dependency parsing, and more.

Compatible with Python 3.12+

Hazm is built on top of the NLTK library and is specifically optimized for the Persian language. It is fully compatible with Python 3.12+.

Maintained by Roshan

Originally started as a personal project, Hazm is now developed and maintained under the Roshan AI team.

Hazm library

Persian Natural Language Processing made easy with Hazm

Installation

You can install Hazm using pip:

$ pip install hazm

Pretrained Models

Hazm requires pretrained models for advanced tasks such as POS tagging, Chunking, and Dependency Parsing. There are two ways to use these models:

1. Automatic Loading (Hugging Face Hub)

The latest version of Hazm integrates directly with the Hugging Face Hub. You can load models automatically by providing the repo_id and model_filename:

from hazm import POSTagger

# This will automatically download and cache the model from Hugging Face
tagger = POSTagger(
    repo_id="roshan-research/hazm-postagger", 
    model_filename="pos_tagger.model"
)

2. Manual Loading

If you prefer to work offline, you can download the models manually and provide the local path to the constructor:

from hazm import POSTagger

# Provide the local path to the downloaded model file
tagger = POSTagger(model="path/to/your/pos_tagger.model")

Quick Start

Import Hazm into your project and start processing Persian text immediately:

from hazm import *
from hazm import *

# ===============================
# Stemming
# ===============================
stemmer = Stemmer()
stem = stemmer.stem('کتاب‌ها')
print(stem) # کتاب

# ===============================
# Normalizing
# ===============================
normalizer = Normalizer()
normalized_text = normalizer.normalize('من کتاب های زیــــادی دارم .')
print(normalized_text) # من کتاب‌های زیادی دارم.

# ===============================
# Lemmatizing
# ===============================
lemmatizer = Lemmatizer()
lem = lemmatizer.lemmatize('می‌نویسیم')
print(lem) # نوشت#نویس

# ===============================
# Sentence tokenizing
# ===============================
sentence_tokenizer = SentenceTokenizer()
sent_tokens = sentence_tokenizer.tokenize('ما کتاب می‌خوانیم. یادگیری خوب است.')
print(sent_tokens) # ['ما کتاب می\u200cخوانیم.', 'یادگیری خوب است.']

# ===============================
# Word tokenizing
# ===============================
word_tokenizer = WordTokenizer()
word_tokens = word_tokenizer.tokenize('ما کتاب می‌خوانیم')
print(word_tokens) # ['ما', 'کتاب', 'می\u200cخوانیم']

# ===============================
# Part of speech tagging
# ===============================
tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
tagged_words = tagger.tag(word_tokens)
print(tagged_words) # [('ما', 'PRON'), ('کتاب', 'NOUN'), ('می\u200cخوانیم', 'VERB')]

# ===============================
# Chunking
# ===============================
chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
chunked_tree = tree2brackets(chunker.parse(tagged_words))
print(chunked_tree) # [ما NP] [کتاب NP] [می‌خوانیم VP]

# ===============================
# Word embedding
# ===============================
word_embedding = WordEmbedding.load(repo_id='roshan-research/hazm-word-embedding', model_filename='fasttext_skipgram_300.bin', model_type='fasttext')
odd_word = word_embedding.doesnt_match(['کتاب', 'دفتر', 'قلم', 'پنجره'])
print(odd_word) # پنجره

# ===============================
# Sentence embedding
# ===============================
sent_embedding = SentEmbedding.load(repo_id='roshan-research/hazm-sent-embedding', model_filename='sent2vec-naab.model')
sentence_similarity = sent_embedding.similarity('او شیر میخورد','شیر غذا می‌خورد')
print(sentence_similarity) # 0.4643607437610626

# ===============================
# Dependency parsing
# ===============================
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, repo_id="roshan-research/hazm-dependency-parser", model_filename="langModel.mco")
dependency_graph = parser.parse(word_tokens)
print(dependency_graph)
"""
{0:  {'address': 0,
      'ctag': 'TOP',
      'deps': defaultdict(<class 'list'>, {'root': [3]}),
      'feats': None,
      'head': None,
      'lemma': None,
      'rel': None,
      'tag': 'TOP',
      'word': None},
  1: {'address': 1,
      'ctag': 'PRON',
      'deps': defaultdict(<class 'list'>, {}),
      'feats': '_',
      'head': 3,
      'lemma': 'ما',
      'rel': 'SBJ',
      'tag': 'PRON',
      'word': 'ما'},
  2: {'address': 2,
      'ctag': 'NOUN',
      'deps': defaultdict(<class 'list'>, {}),
      'feats': '_',
      'head': 3,
      'lemma': 'کتاب',
      'rel': 'OBJ',
      'tag': 'NOUN',
      'word': 'کتاب'},
  3: {'address': 3,
      'ctag': 'VERB',
      'deps': defaultdict(<class 'list'>, {'SBJ': [1], 'OBJ': [2]}),
      'feats': '_',
      'head': 0,
      'lemma': 'خواند#خوان',
      'rel': 'root',
      'tag': 'VERB',
      'word': 'می\u200cخوانیم'}})

"""

Next Steps