Normalizer
This module contains classes and functions for text normalization.
Normalizer
¶
Bases: NormalizerProtocol
This class includes functions for text normalization.
__init__(correct_spacing=True, remove_diacritics=True, remove_specials_chars=True, decrease_repeated_chars=True, persian_style=True, persian_numbers=True, unicodes_replacement=True, seperate_mi=True)
¶
Constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
correct_spacing
|
bool
|
If True, corrects spacing in text, punctuation, prefixes, and suffixes. |
True
|
remove_diacritics
|
bool
|
If True, removes diacritics from characters. |
True
|
remove_specials_chars
|
bool
|
If True, removes special characters not useful for text processing. |
True
|
decrease_repeated_chars
|
bool
|
If True, reduces character repetitions greater than 2 to 2. |
True
|
persian_style
|
bool
|
If True, applies Persian-specific style corrections (e.g., replacing quotes with guillemets). |
True
|
persian_numbers
|
bool
|
If True, replaces English numbers with Persian numbers. |
True
|
unicodes_replacement
|
bool
|
If True, replaces certain Unicode characters with their normalized equivalents. |
True
|
seperate_mi
|
bool
|
If True, separates the 'mi' prefix in verbs. |
True
|
correct_spacing(text)
¶
Corrects spacing in text.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.correct_spacing("سلام دنیا")
'سلام دنیا'
>>> normalizer.correct_spacing("به طول ۹متر و عرض۶")
'به طول ۹ متر و عرض ۶'
>>> normalizer.correct_spacing("کاروانسرا")
'کاروانسرا'
>>> normalizer.correct_spacing("سلام به همه")
'سلام به همه'
>>> normalizer.correct_spacing("سلام دنیـــا")
'سلام دنیا'
>>> normalizer.correct_spacing("جمعهها که کار نمی کنم مطالعه می کنم")
'جمعهها که کار نمیکنم مطالعه میکنم'
>>> normalizer.correct_spacing(' "سلام به همه" ')
'"سلام به همه"'
>>> normalizer.correct_spacing('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to correct spacing for. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with corrected spacing. |
decrease_repeated_chars(text)
¶
Reduces character repetitions greater than 2 to 2.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.decrease_repeated_chars('سلامممم به همه')
'سلام به همه'
>>> normalizer.decrease_repeated_chars('سلامم به همه')
'سلامم به همه'
>>> normalizer.decrease_repeated_chars('سلامم را برسان')
'سلامم را برسان'
>>> normalizer.decrease_repeated_chars('سلاممم را برسان')
'سلام را برسان'
>>> normalizer.decrease_repeated_chars('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to reduce repeated characters in. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with reduced character repetitions. |
normalize(text)
¶
Normalizes the text.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.normalize('اِعلاممممم کَرد : « زمین لرزه ای به بُزرگیِ 6 دهم ریشتر ...»')
'اعلام کرد: «زمینلرزهای به بزرگی ۶ دهم ریشتر …»'
>>> normalizer.normalize('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be normalized. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized text. |
persian_number(text)
¶
Replaces English numbers with Persian numbers.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.persian_number('5 درصد')
'۵ درصد'
>>> normalizer.persian_number('۵ درصد')
'۵ درصد'
>>> normalizer.persian_number('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to replace English numbers in. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with Persian numbers. |
persian_style(text)
¶
Applies Persian style corrections to the text.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.persian_style('"نرمالسازی"')
'«نرمالسازی»'
>>> normalizer.persian_style('و ...')
'و …'
>>> normalizer.persian_style('10.450')
'10٫450'
>>> normalizer.persian_style('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to apply Persian style corrections to. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with Persian style corrections. |
remove_diacritics(text)
¶
Removes diacritics from the text.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.remove_diacritics('حَذفِ اِعراب')
'حذف اعراب'
>>> normalizer.remove_diacritics('آمدند')
'آمدند'
>>> normalizer.remove_diacritics('متن بدون اعراب')
'متن بدون اعراب'
>>> normalizer.remove_diacritics('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to remove diacritics from. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text without diacritics. |
remove_specials_chars(text)
¶
Removes special characters from the text.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to remove special characters from. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text without special characters. |
seperate_mi(text)
¶
Separates the 'mi' prefix in verbs.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.seperate_mi('نمیدانم چه میگفت')
'نمیدانم چه میگفت'
>>> normalizer.seperate_mi('میز')
'میز'
>>> normalizer.seperate_mi('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to separate 'mi' in. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with 'mi' separated. |
token_spacing(tokens)
¶
Merges tokens that should be joined.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.token_spacing(['کتاب', 'ها'])
['کتابها']
>>> normalizer.token_spacing(['او', 'می', 'رود'])
['او', 'میرود']
>>> normalizer.token_spacing(['ماه', 'می', 'سال', 'جدید'])
['ماه', 'می', 'سال', 'جدید']
>>> normalizer.token_spacing(['اخلال', 'گر'])
['اخلالگر']
>>> normalizer.token_spacing(['زمین', 'لرزه', 'ای'])
['زمینلرزهای']
>>> normalizer.token_spacing([])
[]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[str]
|
The tokens to process. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of processed tokens. |
unicodes_replacement(text)
¶
Replaces certain Unicode characters with normalized equivalents.
Examples:
>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '
>>> normalizer.remove_specials_chars('')
''
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to replace Unicode characters in. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The text with normalized Unicode characters. |