Word Tokenizer
This module includes classes and functions for extracting words from text.
WordTokenizer
¶
Bases: TokenizerI, TokenizerProtocol
This class includes methods for extracting words from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
words_file
|
str | Path
|
Path to the file containing the list of words. Hazm provides a default file; however, you can introduce your own file. Refer to the default file to understand its structure. |
default_words
|
verbs_file
|
str | Path
|
Path to the file containing verbs. Hazm provides a default file; however, you can introduce your own file. Refer to the default file to understand its structure. |
default_verbs
|
join_verb_parts
|
bool
|
If |
True
|
join_abbreviations
|
bool
|
If |
False
|
separate_emoji
|
bool
|
If |
False
|
replace_links
|
bool
|
If |
False
|
replace_ids
|
bool
|
If |
False
|
replace_emails
|
bool
|
If |
False
|
replace_numbers
|
bool
|
If |
False
|
replace_hashtags
|
bool
|
If |
False
|
__init__(words_file=default_words, verbs_file=default_verbs, join_verb_parts=True, join_abbreviations=False, separate_emoji=False, replace_links=False, replace_ids=False, replace_emails=False, replace_numbers=False, replace_hashtags=False)
¶
Initializes the WordTokenizer with the specified configurations.
join_verb_parts(tokens)
¶
Joins multi-part verbs with an underscore.
Examples:
>>> tokenizer = WordTokenizer()
>>> tokenizer.join_verb_parts(['خواهد', 'رفت'])
['خواهد_رفت']
>>> tokenizer.join_verb_parts(['رفته', 'است'])
['رفته_است']
>>> tokenizer.join_verb_parts(['گفته', 'شده', 'است'])
['گفته_شده_است']
>>> tokenizer.join_verb_parts(['گفته', 'خواهد', 'شد'])
['گفته_خواهد_شد']
>>> tokenizer.join_verb_parts(['خسته', 'شدید'])
['خسته_شدید']
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[str]
|
A list of word components of a multi-part verb. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list where parts of multi-part verbs are joined by underscores if necessary. |
tokenize(text)
¶
Extracts tokens from the given text.
Examples:
>>> tokenizer = WordTokenizer()
>>> tokenizer.tokenize('این جمله (خیلی) پیچیده نیست!!!')
['این', 'جمله', '(', 'خیلی', ')', 'پیچیده', 'نیست', '!!!']
>>> tokenizer = WordTokenizer(join_verb_parts=False)
>>> print(' '.join(tokenizer.tokenize('سلام.')))
سلام .
>>> tokenizer = WordTokenizer(join_verb_parts=False, replace_links=True)
>>> print(' '.join(tokenizer.tokenize('در قطر هک شد https://t.co/tZOurPSXzi https://t.co/vtJtwsRebP')))
در قطر هک شد LINK LINK
>>> tokenizer = WordTokenizer(join_verb_parts=False, replace_ids=True, replace_numbers=True)
>>> print(' '.join(tokenizer.tokenize('زلزله ۴.۸ ریشتری در هجدک کرمان @bourse24ir')))
زلزله NUMF ریشتری در هجدک کرمان ID
>>> tokenizer = WordTokenizer(join_verb_parts=False, separate_emoji=True)
>>> print(' '.join(tokenizer.tokenize('دیگه میخوام ترک تحصیل کنم 😂😂😂')))
دیگه میخوام ترک تحصیل کنم 😂 😂 😂
>>> tokenizer = WordTokenizer(join_abbreviations=True)
>>> tokenizer.tokenize('امام علی (ع) فرمود: برترین زهد، پنهان داشتن زهد است')
['امام', 'علی', '(ع)', 'فرمود', ':', 'برترین', 'زهد', '،', 'پنهان', 'داشتن', 'زهد', 'است']
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which tokens should be extracted. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of extracted tokens. |
word_tokenize(text)
¶
A helper function to tokenize text into words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of tokens. |