Skip to content

Normalizer

This module contains classes and functions for text normalization.

Normalizer

Bases: NormalizerProtocol

This class includes functions for text normalization.

__init__(correct_spacing=True, remove_diacritics=True, remove_specials_chars=True, decrease_repeated_chars=True, persian_style=True, persian_numbers=True, unicodes_replacement=True, seperate_mi=True)

Constructor.

Parameters:

Name Type Description Default
correct_spacing bool

If True, corrects spacing in text, punctuation, prefixes, and suffixes.

True
remove_diacritics bool

If True, removes diacritics from characters.

True
remove_specials_chars bool

If True, removes special characters not useful for text processing.

True
decrease_repeated_chars bool

If True, reduces character repetitions greater than 2 to 2.

True
persian_style bool

If True, applies Persian-specific style corrections (e.g., replacing quotes with guillemets).

True
persian_numbers bool

If True, replaces English numbers with Persian numbers.

True
unicodes_replacement bool

If True, replaces certain Unicode characters with their normalized equivalents.

True
seperate_mi bool

If True, separates the 'mi' prefix in verbs.

True

correct_spacing(text)

Corrects spacing in text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.correct_spacing("سلام   دنیا")
'سلام دنیا'
>>> normalizer.correct_spacing("به طول ۹متر و عرض۶")
'به طول ۹ متر و عرض ۶'
>>> normalizer.correct_spacing("کاروان‌‌سرا")
'کاروان‌سرا'
>>> normalizer.correct_spacing("‌سلام‌ به ‌همه‌")
'سلام به همه'
>>> normalizer.correct_spacing("سلام دنیـــا")
'سلام دنیا'
>>> normalizer.correct_spacing("جمعهها که کار نمی کنم مطالعه می کنم")
'جمعه‌ها که کار نمی‌کنم مطالعه می‌کنم'
>>> normalizer.correct_spacing(' "سلام به همه"   ')
'"سلام به همه"'
>>> normalizer.correct_spacing('')
''

Parameters:

Name Type Description Default
text str

The text to correct spacing for.

required

Returns:

Type Description
str

The text with corrected spacing.

decrease_repeated_chars(text)

Reduces character repetitions greater than 2 to 2.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.decrease_repeated_chars('سلامممم به همه')
'سلام به همه'
>>> normalizer.decrease_repeated_chars('سلامم به همه')
'سلامم به همه'
>>> normalizer.decrease_repeated_chars('سلامم را برسان')
'سلامم را برسان'
>>> normalizer.decrease_repeated_chars('سلاممم را برسان')
'سلام را برسان'
>>> normalizer.decrease_repeated_chars('')
''

Parameters:

Name Type Description Default
text str

The text to reduce repeated characters in.

required

Returns:

Type Description
str

The text with reduced character repetitions.

normalize(text)

Normalizes the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.normalize('اِعلاممممم کَرد : « زمین لرزه ای به بُزرگیِ 6 دهم ریشتر ...»')
'اعلام کرد: «زمین‌لرزه‌ای به بزرگی ۶ دهم ریشتر …»'
>>> normalizer.normalize('')
''

Parameters:

Name Type Description Default
text str

The text to be normalized.

required

Returns:

Type Description
str

The normalized text.

persian_number(text)

Replaces English numbers with Persian numbers.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.persian_number('5 درصد')
'۵ درصد'
>>> normalizer.persian_number('۵ درصد')
'۵ درصد'
>>> normalizer.persian_number('')
''

Parameters:

Name Type Description Default
text str

The text to replace English numbers in.

required

Returns:

Type Description
str

The text with Persian numbers.

persian_style(text)

Applies Persian style corrections to the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.persian_style('"نرمال‌سازی"')
'«نرمال‌سازی»'
>>> normalizer.persian_style('و ...')
'و …'
>>> normalizer.persian_style('10.450')
'10٫450'
>>> normalizer.persian_style('')
''

Parameters:

Name Type Description Default
text str

The text to apply Persian style corrections to.

required

Returns:

Type Description
str

The text with Persian style corrections.

remove_diacritics(text)

Removes diacritics from the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_diacritics('حَذفِ اِعراب')
'حذف اعراب'
>>> normalizer.remove_diacritics('آمدند')
'آمدند'
>>> normalizer.remove_diacritics('متن بدون اعراب')
'متن بدون اعراب'
>>> normalizer.remove_diacritics('')
''

Parameters:

Name Type Description Default
text str

The text to remove diacritics from.

required

Returns:

Type Description
str

The text without diacritics.

remove_specials_chars(text)

Removes special characters from the text.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '

Parameters:

Name Type Description Default
text str

The text to remove special characters from.

required

Returns:

Type Description
str

The text without special characters.

seperate_mi(text)

Separates the 'mi' prefix in verbs.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.seperate_mi('نمیدانم چه میگفت')
'نمی‌دانم چه می‌گفت'
>>> normalizer.seperate_mi('میز')
'میز'
>>> normalizer.seperate_mi('')
''

Parameters:

Name Type Description Default
text str

The text to separate 'mi' in.

required

Returns:

Type Description
str

The text with 'mi' separated.

token_spacing(tokens)

Merges tokens that should be joined.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.token_spacing(['کتاب', 'ها'])
['کتاب‌ها']
>>> normalizer.token_spacing(['او', 'می', 'رود'])
['او', 'می‌رود']
>>> normalizer.token_spacing(['ماه', 'می', 'سال', 'جدید'])
['ماه', 'می', 'سال', 'جدید']
>>> normalizer.token_spacing(['اخلال', 'گر'])
['اخلال‌گر']
>>> normalizer.token_spacing(['زمین', 'لرزه', 'ای'])
['زمین‌لرزه‌ای']
>>> normalizer.token_spacing([])
[]

Parameters:

Name Type Description Default
tokens list[str]

The tokens to process.

required

Returns:

Type Description
list[str]

A list of processed tokens.

unicodes_replacement(text)

Replaces certain Unicode characters with normalized equivalents.

Examples:

>>> normalizer = Normalizer()
>>> normalizer.remove_specials_chars('پیامبر اکرم ﷺ')
'پیامبر اکرم '
>>> normalizer.remove_specials_chars('')
''

Parameters:

Name Type Description Default
text str

The text to replace Unicode characters in.

required

Returns:

Type Description
str

The text with normalized Unicode characters.