Normalizer

This module contains classes and functions for text normalization.

`Normalizer` ¶

Bases: NormalizerProtocol

This class includes functions for text normalization.

`init(correct_spacing=True, remove_diacritics=True, remove_specials_chars=True, decrease_repeated_chars=True, persian_style=True, persian_numbers=True, unicodes_replacement=True, seperate_mi=True)` ¶

Constructor.

Parameters:

Name	Type	Description	Default
`correct_spacing`	`bool`	If True, corrects spacing in text, punctuation, prefixes, and suffixes.	`True`
`remove_diacritics`	`bool`	If True, removes diacritics from characters.	`True`
`remove_specials_chars`	`bool`	If True, removes special characters not useful for text processing.	`True`
`decrease_repeated_chars`	`bool`	If True, reduces character repetitions greater than 2 to 2.	`True`
`persian_style`	`bool`	If True, applies Persian-specific style corrections (e.g., replacing quotes with guillemets).	`True`
`persian_numbers`	`bool`	If True, replaces English numbers with Persian numbers.	`True`
`unicodes_replacement`	`bool`	If True, replaces certain Unicode characters with their normalized equivalents.	`True`
`seperate_mi`	`bool`	If True, separates the 'mi' prefix in verbs.	`True`

`correct_spacing(text)` ¶

Corrects spacing in text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.correct_spacing("سلام   دنیا")
'سلام دنیا'
>>> normalizer.correct_spacing("به طول ۹متر و عرض۶")
'به طول ۹ متر و عرض ۶'
>>> normalizer.correct_spacing("کاروان‌‌سرا")
'کاروان‌سرا'
>>> normalizer.correct_spacing("‌سلام‌ به ‌همه‌")
'سلام به همه'
>>> normalizer.correct_spacing("سلام دنیـــا")
'سلام دنیا'
>>> normalizer.correct_spacing("جمعهها که کار نمی کنم مطالعه می کنم")
'جمعه‌ها که کار نمی‌کنم مطالعه می‌کنم'
>>> normalizer.correct_spacing(' "سلام به همه"   ')
'"سلام به همه"'
>>> normalizer.correct_spacing('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to correct spacing for.	required

Returns:

Type	Description
`str`	The text with corrected spacing.

`decrease_repeated_chars(text)` ¶

Reduces character repetitions greater than 2 to 2.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.decrease_repeated_chars('سلامممم به همه')
'سلام به همه'
>>> normalizer.decrease_repeated_chars('سلامم به همه')
'سلامم به همه'
>>> normalizer.decrease_repeated_chars('سلامم را برسان')
'سلامم را برسان'
>>> normalizer.decrease_repeated_chars('سلاممم را برسان')
'سلام را برسان'
>>> normalizer.decrease_repeated_chars('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to reduce repeated characters in.	required

Returns:

Type	Description
`str`	The text with reduced character repetitions.

`normalize(text)` ¶

Normalizes the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.normalize('اِعلاممممم کَرد : « زمین لرزه ای به بُزرگیِ 6 دهم ریشتر ...»')
'اعلام کرد: «زمین‌لرزه‌ای به بزرگی ۶ دهم ریشتر …»'
>>> normalizer.normalize('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to be normalized.	required

Returns:

Type	Description
`str`	The normalized text.

`persian_number(text)` ¶

Replaces English numbers with Persian numbers.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.persian_number('5 درصد')
'۵ درصد'
>>> normalizer.persian_number('۵ درصد')
'۵ درصد'
>>> normalizer.persian_number('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to replace English numbers in.	required

Returns:

Type	Description
`str`	The text with Persian numbers.

`persian_style(text)` ¶

Applies Persian style corrections to the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.persian_style('"نرمال‌سازی"')
'«نرمال‌سازی»'
>>> normalizer.persian_style('و ...')
'و …'
>>> normalizer.persian_style('10.450')
'10٫450'
>>> normalizer.persian_style('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to apply Persian style corrections to.	required

Returns:

Type	Description
`str`	The text with Persian style corrections.

`remove_diacritics(text)` ¶

Removes diacritics from the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_diacritics('حَذفِ اِعراب')
'حذف اعراب'
>>> normalizer.remove_diacritics('آمدند')
'آمدند'
>>> normalizer.remove_diacritics('متن بدون اعراب')
'متن بدون اعراب'
>>> normalizer.remove_diacritics('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to remove diacritics from.	required

Returns:

Type	Description
`str`	The text without diacritics.

`remove_specials_chars(text)` ¶

Removes special characters from the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to remove special characters from.	required

Returns:

Type	Description
`str`	The text without special characters.

`seperate_mi(text)` ¶

Separates the 'mi' prefix in verbs.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.seperate_mi('نمیدانم چه میگفت')
'نمی‌دانم چه می‌گفت'
>>> normalizer.seperate_mi('میز')
'میز'
>>> normalizer.seperate_mi('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to separate 'mi' in.	required

Returns:

Type	Description
`str`	The text with 'mi' separated.

`token_spacing(tokens)` ¶

Merges tokens that should be joined.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.token_spacing(['کتاب', 'ها'])
['کتاب‌ها']
>>> normalizer.token_spacing(['او', 'می', 'رود'])
['او', 'می‌رود']
>>> normalizer.token_spacing(['ماه', 'می', 'سال', 'جدید'])
['ماه', 'می', 'سال', 'جدید']
>>> normalizer.token_spacing(['اخلال', 'گر'])
['اخلال‌گر']
>>> normalizer.token_spacing(['زمین', 'لرزه', 'ای'])
['زمین‌لرزه‌ای']
>>> normalizer.token_spacing([])
[]

Parameters:

Name	Type	Description	Default
`tokens`	`list[str]`	The tokens to process.	required

Returns:

Type	Description
`list[str]`	A list of processed tokens.

`unicodes_replacement(text)` ¶

Replaces certain Unicode characters with normalized equivalents.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '
>>> normalizer.remove_specials_chars('')
''

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to replace Unicode characters in.	required

Returns:

Type	Description
`str`	The text with normalized Unicode characters.

Normalizer

Normalizer ¶

__init__(correct_spacing=True, remove_diacritics=True, remove_specials_chars=True, decrease_repeated_chars=True, persian_style=True, persian_numbers=True, unicodes_replacement=True, seperate_mi=True) ¶

correct_spacing(text) ¶

decrease_repeated_chars(text) ¶

normalize(text) ¶

persian_number(text) ¶

persian_style(text) ¶

remove_diacritics(text) ¶

remove_specials_chars(text) ¶

seperate_mi(text) ¶

token_spacing(tokens) ¶

unicodes_replacement(text) ¶

`Normalizer` ¶

`init(correct_spacing=True, remove_diacritics=True, remove_specials_chars=True, decrease_repeated_chars=True, persian_style=True, persian_numbers=True, unicodes_replacement=True, seperate_mi=True)` ¶

`correct_spacing(text)` ¶

`decrease_repeated_chars(text)` ¶

`normalize(text)` ¶

`persian_number(text)` ¶

`persian_style(text)` ¶

`remove_diacritics(text)` ¶

`remove_specials_chars(text)` ¶

`seperate_mi(text)` ¶

`token_spacing(tokens)` ¶

`unicodes_replacement(text)` ¶