Skip to content

Word Tokenizer

This module includes classes and functions for extracting words from text.

WordTokenizer

Bases: TokenizerI, TokenizerProtocol

This class includes methods for extracting words from text.

Parameters:

Name Type Description Default
words_file str | Path

Path to the file containing the list of words. Hazm provides a default file; however, you can introduce your own file. Refer to the default file to understand its structure.

default_words
verbs_file str | Path

Path to the file containing verbs. Hazm provides a default file; however, you can introduce your own file. Refer to the default file to understand its structure.

default_verbs
join_verb_parts bool

If True, joins multi-part verbs with an underscore; for example, 'گفته شده است' becomes 'گفته_شده_است'.

True
join_abbreviations bool

If True, prevents abbreviations from being split and returns them as a single token.

False
separate_emoji bool

If True, separates emojis with a space.

False
replace_links bool

If True, replaces links with the word LINK.

False
replace_ids bool

If True, replaces IDs with the word ID.

False
replace_emails bool

If True, replaces email addresses with the word EMAIL.

False
replace_numbers bool

If True, replaces decimal numbers with NUMF and integers with NUM. For non-decimal numbers, the number of digits is appended to NUM.

False
replace_hashtags bool

If True, replaces the # symbol with TAG.

False

__init__(words_file=default_words, verbs_file=default_verbs, join_verb_parts=True, join_abbreviations=False, separate_emoji=False, replace_links=False, replace_ids=False, replace_emails=False, replace_numbers=False, replace_hashtags=False)

Initializes the WordTokenizer with the specified configurations.

join_verb_parts(tokens)

Joins multi-part verbs with an underscore.

Examples:

>>> tokenizer = WordTokenizer()
>>> tokenizer.join_verb_parts(['خواهد', 'رفت'])
['خواهد_رفت']
>>> tokenizer.join_verb_parts(['رفته', 'است'])
['رفته_است']
>>> tokenizer.join_verb_parts(['گفته', 'شده', 'است'])
['گفته_شده_است']
>>> tokenizer.join_verb_parts(['گفته', 'خواهد', 'شد'])
['گفته_خواهد_شد']
>>> tokenizer.join_verb_parts(['خسته', 'شدید'])
['خسته_شدید']

Parameters:

Name Type Description Default
tokens list[str]

A list of word components of a multi-part verb.

required

Returns:

Type Description
list[str]

A list where parts of multi-part verbs are joined by underscores if necessary.

tokenize(text)

Extracts tokens from the given text.

Examples:

>>> tokenizer = WordTokenizer()
>>> tokenizer.tokenize('این جمله (خیلی) پیچیده نیست!!!')
['این', 'جمله', '(', 'خیلی', ')', 'پیچیده', 'نیست', '!!!']
>>> tokenizer = WordTokenizer(join_verb_parts=False)
>>> print(' '.join(tokenizer.tokenize('سلام.')))
سلام .
>>> tokenizer = WordTokenizer(join_verb_parts=False, replace_links=True)
>>> print(' '.join(tokenizer.tokenize('در قطر هک شد https://t.co/tZOurPSXzi https://t.co/vtJtwsRebP')))
در قطر هک شد LINK LINK
>>> tokenizer = WordTokenizer(join_verb_parts=False, replace_ids=True, replace_numbers=True)
>>> print(' '.join(tokenizer.tokenize('زلزله ۴.۸ ریشتری در هجدک کرمان @bourse24ir')))
زلزله NUMF ریشتری در هجدک کرمان ID
>>> tokenizer = WordTokenizer(join_verb_parts=False, separate_emoji=True)
>>> print(' '.join(tokenizer.tokenize('دیگه میخوام ترک تحصیل کنم 😂😂😂')))
دیگه میخوام ترک تحصیل کنم 😂 😂 😂
>>> tokenizer = WordTokenizer(join_abbreviations=True)
>>> tokenizer.tokenize('امام علی (ع) فرمود: برترین زهد، پنهان داشتن زهد است')
['امام', 'علی', '(ع)', 'فرمود', ':', 'برترین', 'زهد', '،', 'پنهان', 'داشتن', 'زهد', 'است']

Parameters:

Name Type Description Default
text str

The text from which tokens should be extracted.

required

Returns:

Type Description
list[str]

A list of extracted tokens.

word_tokenize(text)

A helper function to tokenize text into words.

Parameters:

Name Type Description Default
text str

The input text.

required

Returns:

Type Description
list[str]

A list of tokens.