Token Splitter
This module includes classes and functions for splitting a token into two smaller tokens.
TokenSplitter
¶
This class includes methods for splitting a token into two smaller tokens.
__init__()
¶
Initializes the TokenSplitter and loads the necessary lemmatizer data.
split_token_words(token)
¶
Splits the input token into two smaller tokens.
If the token can be split in more than one way, it returns all possible states;
for example, 'داستانسرا' can be split into both ['داستان', 'سرا'] and
['داستانسرا'], so it returns both: [('داستان', 'سرا'), ('داستانسرا',)].
Examples:
>>> splitter = TokenSplitter()
>>> splitter.split_token_words('صداوسیماجمهوری')
[('صداوسیما', 'جمهوری')]
>>> splitter.split_token_words('صداو')
[('صد', 'او'), ('صدا', 'و')]
>>> splitter.split_token_words('داستانسرا')
[('داستان', 'سرا'), ('داستانسرا',)]
>>> splitter.split_token_words('دستانسرا')
[('دستان', 'سرا')]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str
|
The token to be processed. |
required |
Returns:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
A list of tuples, each containing the split parts of the token. |