Skip to content

Bijankhan Reader

This module includes classes and functions for reading the Bijankhan corpus.

Bijankhan Corpus is a collection of Persian texts containing more than 2.6 million words, tagged with 550 types of POS tags. This corpus, prepared at the Intelligent Signal Processing Research Center, also includes more than 4,300 thematic tags such as political, historical, etc., for the texts.

BijankhanReader

This class includes methods for reading the Bijankhan corpus.

Parameters:

Name Type Description Default
bijankhan_file str

Path to the corpus file.

required
joined_verb_parts bool

If True, joins multi-part verbs with an underscore.

True
pos_map str | None

A dictionary for converting fine-grained to coarse-grained POS tags.

None

__init__(bijankhan_file, joined_verb_parts=True, pos_map=None)

Initializes the BijankhanReader with the corpus file and settings.

Parameters:

Name Type Description Default
bijankhan_file str

Path to the corpus file.

required
joined_verb_parts bool

If True, joins multi-part verbs with an underscore.

True
pos_map str | None

A dictionary for converting fine-grained to coarse-grained POS tags.

None

sents()

Returns corpus sentences as a list of (token, tag) tuples.

Examples:

>>> bijankhan = BijankhanReader(bijankhan_file='bijankhan.txt')
>>> next(bijankhan.sents())
[('اولین', 'ADJ'), ('سیاره', 'N'), ('خارج', 'ADJ'), ('از', 'PREP'), ('منظومه', 'N'), ('شمسی', 'ADJ'), ('دیده_شد', 'V'), ('.', 'PUNC')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence as a list of (token, tag) tuples.