Hazm: Persian NLP Toolkit¶
Hazm is a comprehensive Python library for processing the Persian language. It provides tools for text normalization, sentence and word tokenization, stemming, lemmatization, part-of-speech tagging, syntactic dependency parsing, and more.
Compatible with Python 3.12+
Hazm is built on top of the NLTK library and is specifically optimized for the Persian language. It is fully compatible with Python 3.12+.
Maintained by Roshan
Originally started as a personal project, Hazm is now developed and maintained under the Roshan AI team.
Installation¶
You can install Hazm using pip:
$ pip install hazm
Pretrained Models¶
Hazm requires pretrained models for advanced tasks such as POS tagging, Chunking, and Dependency Parsing. There are two ways to use these models:
1. Automatic Loading (Hugging Face Hub)¶
The latest version of Hazm integrates directly with the Hugging Face Hub. You can load models automatically by providing the repo_id and model_filename:
from hazm import POSTagger
# This will automatically download and cache the model from Hugging Face
tagger = POSTagger(
repo_id="roshan-research/hazm-postagger",
model_filename="pos_tagger.model"
)
2. Manual Loading¶
If you prefer to work offline, you can download the models manually and provide the local path to the constructor:
from hazm import POSTagger
# Provide the local path to the downloaded model file
tagger = POSTagger(model="path/to/your/pos_tagger.model")
Quick Start¶
Import Hazm into your project and start processing Persian text immediately:
from hazm import *
from hazm import *
# ===============================
# Stemming
# ===============================
stemmer = Stemmer()
stem = stemmer.stem('کتابها')
print(stem) # کتاب
# ===============================
# Normalizing
# ===============================
normalizer = Normalizer()
normalized_text = normalizer.normalize('من کتاب های زیــــادی دارم .')
print(normalized_text) # من کتابهای زیادی دارم.
# ===============================
# Lemmatizing
# ===============================
lemmatizer = Lemmatizer()
lem = lemmatizer.lemmatize('مینویسیم')
print(lem) # نوشت#نویس
# ===============================
# Sentence tokenizing
# ===============================
sentence_tokenizer = SentenceTokenizer()
sent_tokens = sentence_tokenizer.tokenize('ما کتاب میخوانیم. یادگیری خوب است.')
print(sent_tokens) # ['ما کتاب می\u200cخوانیم.', 'یادگیری خوب است.']
# ===============================
# Word tokenizing
# ===============================
word_tokenizer = WordTokenizer()
word_tokens = word_tokenizer.tokenize('ما کتاب میخوانیم')
print(word_tokens) # ['ما', 'کتاب', 'می\u200cخوانیم']
# ===============================
# Part of speech tagging
# ===============================
tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
tagged_words = tagger.tag(word_tokens)
print(tagged_words) # [('ما', 'PRON'), ('کتاب', 'NOUN'), ('می\u200cخوانیم', 'VERB')]
# ===============================
# Chunking
# ===============================
chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
chunked_tree = tree2brackets(chunker.parse(tagged_words))
print(chunked_tree) # [ما NP] [کتاب NP] [میخوانیم VP]
# ===============================
# Word embedding
# ===============================
word_embedding = WordEmbedding.load(repo_id='roshan-research/hazm-word-embedding', model_filename='fasttext_skipgram_300.bin', model_type='fasttext')
odd_word = word_embedding.doesnt_match(['کتاب', 'دفتر', 'قلم', 'پنجره'])
print(odd_word) # پنجره
# ===============================
# Sentence embedding
# ===============================
sent_embedding = SentEmbedding.load(repo_id='roshan-research/hazm-sent-embedding', model_filename='sent2vec-naab.model')
sentence_similarity = sent_embedding.similarity('او شیر میخورد','شیر غذا میخورد')
print(sentence_similarity) # 0.4643607437610626
# ===============================
# Dependency parsing
# ===============================
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, repo_id="roshan-research/hazm-dependency-parser", model_filename="langModel.mco")
dependency_graph = parser.parse(word_tokens)
print(dependency_graph)
"""
{0: {'address': 0,
'ctag': 'TOP',
'deps': defaultdict(<class 'list'>, {'root': [3]}),
'feats': None,
'head': None,
'lemma': None,
'rel': None,
'tag': 'TOP',
'word': None},
1: {'address': 1,
'ctag': 'PRON',
'deps': defaultdict(<class 'list'>, {}),
'feats': '_',
'head': 3,
'lemma': 'ما',
'rel': 'SBJ',
'tag': 'PRON',
'word': 'ما'},
2: {'address': 2,
'ctag': 'NOUN',
'deps': defaultdict(<class 'list'>, {}),
'feats': '_',
'head': 3,
'lemma': 'کتاب',
'rel': 'OBJ',
'tag': 'NOUN',
'word': 'کتاب'},
3: {'address': 3,
'ctag': 'VERB',
'deps': defaultdict(<class 'list'>, {'SBJ': [1], 'OBJ': [2]}),
'feats': '_',
'head': 0,
'lemma': 'خواند#خوان',
'rel': 'root',
'tag': 'VERB',
'word': 'می\u200cخوانیم'}})
"""
Next Steps¶
- Explore detailed documentation for each module in the Classes and Functions section.
- Learn how to work with various Persian corpora in the Corpus Readers section.
- If you are looking for Hazm in other environments, check out the Ports in Other Languages section.
