Bijankhan Reader
This module includes classes and functions for reading the Bijankhan corpus.
Bijankhan Corpus is a collection of Persian texts containing more than 2.6 million words, tagged with 550 types of POS tags. This corpus, prepared at the Intelligent Signal Processing Research Center, also includes more than 4,300 thematic tags such as political, historical, etc., for the texts.
BijankhanReader
¶
This class includes methods for reading the Bijankhan corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bijankhan_file
|
str
|
Path to the corpus file. |
required |
joined_verb_parts
|
bool
|
If |
True
|
pos_map
|
str | None
|
A dictionary for converting fine-grained to coarse-grained POS tags. |
None
|
__init__(bijankhan_file, joined_verb_parts=True, pos_map=None)
¶
Initializes the BijankhanReader with the corpus file and settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bijankhan_file
|
str
|
Path to the corpus file. |
required |
joined_verb_parts
|
bool
|
If |
True
|
pos_map
|
str | None
|
A dictionary for converting fine-grained to coarse-grained POS tags. |
None
|
sents()
¶
Returns corpus sentences as a list of (token, tag) tuples.
Examples:
>>> bijankhan = BijankhanReader(bijankhan_file='bijankhan.txt')
>>> next(bijankhan.sents())
[('اولین', 'ADJ'), ('سیاره', 'N'), ('خارج', 'ADJ'), ('از', 'PREP'), ('منظومه', 'N'), ('شمسی', 'ADJ'), ('دیده_شد', 'V'), ('.', 'PUNC')]
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence as a list of (token, tag) tuples. |