Skip to content

Token Splitter

This module includes classes and functions for splitting a token into two smaller tokens.

TokenSplitter

This class includes methods for splitting a token into two smaller tokens.

__init__()

Initializes the TokenSplitter and loads the necessary lemmatizer data.

split_token_words(token)

Splits the input token into two smaller tokens.

If the token can be split in more than one way, it returns all possible states; for example, 'داستان‌سرا' can be split into both ['داستان', 'سرا'] and ['داستان‌سرا'], so it returns both: [('داستان', 'سرا'), ('داستان‌سرا',)].

Examples:

>>> splitter = TokenSplitter()
>>> splitter.split_token_words('صداوسیماجمهوری')
[('صداوسیما', 'جمهوری')]
>>> splitter.split_token_words('صداو')
[('صد', 'او'), ('صدا', 'و')]
>>> splitter.split_token_words('داستان‌سرا')
[('داستان', 'سرا'), ('داستان‌سرا',)]
>>> splitter.split_token_words('دستان‌سرا')
[('دستان', 'سرا')]

Parameters:

Name Type Description Default
token str

The token to be processed.

required

Returns:

Type Description
list[tuple[str, str]]

A list of tuples, each containing the split parts of the token.