Skip to content

Informal Normalizer

This module contains classes and functions for normalizing informal text.

InformalNormalizer

Bases: Normalizer

This class contains functions for normalizing informal text.

Examples:

>>> normalizer = InformalNormalizer()
>>> normalizer.normalize('بابا یه شغل مناسب واسه بچه هام پیدا کردن')
[[['بابا'], ['یک'], ['شغل'], ['مناسب'], ['برای'], ['بچه'], ['هایم'], ['پیدا'], ['کردن', 'کردند']]]

__init__(verb_file=informal_verbs, word_file=informal_words, seperation_flag=False, **kargs)

Constructor.

Parameters:

Name Type Description Default
verb_file str

Path to the file containing informal verbs.

informal_verbs
word_file str

Path to the file containing informal words.

informal_words
seperation_flag bool

If True, adds spaces where necessary in parts of the text.

False
**kargs str

Optional keyword arguments.

{}

informal_conjugations(verb)

Generates informal conjugations of a verb.

Examples:

>>> normalizer = InformalNormalizer()
>>> normalizer.informal_conjugations('رفت')
['رفتم', 'رفتی', 'رفته', 'رفتیم', 'رفتین', 'رفتن', ...]

Parameters:

Name Type Description Default
verb str

The verb to be conjugated.

required

Returns:

Type Description
list[str]

A list of informal conjugations.

normalize(text)

Converts informal text to standard Persian text.

Examples:

>>> normalizer = InformalNormalizer()
>>> normalizer.normalize('بچه هام پیدا کردن که به جایی برنمیخوره !')
[[['بچه'], ['هایم'], ['پیدا'], ['کردن', 'کردند'], ['که'], ['به'], ['جایی'], ['برنمی‌خورد', 'برنمی‌خوره'], ['!']]]

Parameters:

Name Type Description Default
text str

The informal text to be normalized.

required

Returns:

Type Description
list[list[list[str]]]

A list of lists of lists of strings, representing the normalized text structure.

normalized_word(word)

Returns the normalized forms of the word.

Examples:

>>> normalizer = InformalNormalizer()
>>> normalizer.normalized_word('می‌رم')
['می‌روم', 'می‌رم']

Parameters:

Name Type Description Default
word str

The word to be normalized.

required

Returns:

Type Description
list[str]

A list of normalized forms of the word.

split_token_words(token)

Inserts spaces where necessary in the token.

Examples:

>>> normalizer = InformalNormalizer(seperation_flag=True)
>>> normalizer.split_token_words('تورادوست‌دارم')
'تو را دوست دارم'

Parameters:

Name Type Description Default
token str

The token to be processed.

required

Returns:

Type Description
str

The token with correct spacing.