Skip to content

Peykare Reader

This module includes classes and functions for reading the Peykare corpus.

Peykare is a collection of formal written and spoken Persian texts collected from real sources such as newspapers, websites, and pre-typed documents, which have been corrected and tagged. The volume of this data is approximately 100 million words, gathered from various sources with a high degree of diversity. 10 million words of this corpus have been manually tagged by linguistics students using 882 syntactic-semantic tags, and each file is classified by its subject and source. This corpus, prepared by the Intelligent Signal Processing Research Center, is suitable for use in training language models and other natural language processing projects.

PeykareReader

A reader for the Peykare corpus.

Parameters:

Name Type Description Default
root str

Path to the root folder containing corpus files.

required
joined_verb_parts bool

If True, multi-part verbs will be returned as joined tokens.

True
pos_map str

A function to map fine-grained tags to coarse-grained ones.

coarse_pos_e
universal_pos bool

If True, uses the universal POS tagset.

False

__init__(root, joined_verb_parts=True, pos_map=coarse_pos_e, universal_pos=False)

Initializes the PeykareReader.

Parameters:

Name Type Description Default
root str

Path to the folder containing the corpus files.

required
joined_verb_parts bool

If True, multi-part verbs will be joined using an underscore.

True
pos_map str

A mapper for fine-grained to coarse-grained tags.

coarse_pos_e
universal_pos bool

If True, uses universal POS tags.

False

doc_to_sents(document)

Converts an input document into a list of sentences.

Each sentence is a list of (word, tag) tuples.

Parameters:

Name Type Description Default
document str

The raw document text to be converted.

required

Yields:

Type Description
list[tuple[str, str]]

The next sentence in the form of a list of (word, tag) tuples.

docs()

Returns documents as raw text.

Yields:

Type Description
str

The raw text of the next document.

sents()

Returns sentences of the corpus as a list of (token, tag) tuples.

Examples:

>>> peykare = PeykareReader(root='peykare')
>>> next(peykare.sents())
[('دیرزمانی', 'N'), ('از', 'P'), ('راه‌اندازی', 'N,EZ'), ('شبکه‌ی', 'N,EZ'), ('خبر', 'N,EZ'), ('الجزیره', 'N'), ('نمی‌گذرد', 'V'), ('،', 'PUNC'), ('اما', 'CONJ'), ('این', 'DET'), ('شبکه‌ی', 'N,EZ'), ('خبری', 'AJ,EZ'), ('عربی', 'N'), ('بسیار', 'ADV'), ('سریع', 'ADV'), ('توانسته', 'V'), ('در', 'P'), ('میان', 'N,EZ'), ('شبکه‌های', 'N,EZ'), ('عظیم', 'AJ,EZ'), ('خبری', 'AJ'), ('و', 'CONJ'), ('بنگاه‌های', 'N,EZ'), ('چندرسانه‌ای', 'AJ,EZ'), ('دنیا', 'N'), ('خودی', 'N'), ('نشان', 'N'), ('دهد', 'V'), ('.', 'PUNC')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence in the form of a list of (token, tag) tuples.

coarse_pos_e(tags, word)

Converts fine-grained tags to coarse-grained POS tags.

Examples:

>>> coarse_pos_e(['N','COM','SING'],'الجزیره')
'N'

Parameters:

Name Type Description Default
tags list[str]

List of fine-grained tags.

required
word str

The word associated with the tags.

required

Returns:

Type Description
list[str]

List of coarse-grained tags.

coarse_pos_u(tags, word)

Converts fine-grained tags to coarse-grained universal POS tags.

Examples:

>>> coarse_pos_u(['N','COM','SING'], 'الجزیره')
'NOUN'

Parameters:

Name Type Description Default
tags list[str]

List of fine-grained tags.

required
word str

The word to be converted to a universal tag.

required

Returns:

Type Description
list[str]

List of coarse-grained universal POS tags.

join_verb_parts(sentence)

Joins multi-part verbs with an underscore character.

Takes a sentence in the form of a list of (token, tag) tuples and joins tokens belonging to multi-part verbs using an underscore (_).

Examples:

>>> join_verb_parts([('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده', 'AJ'), ('شد', 'V'), ('.', 'PUNC')])
[('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده_شد', 'V'), ('.', 'PUNC')]

Parameters:

Name Type Description Default
sentence list[tuple[str, str]]

Sentence as a list of (token, tag) tuples.

required

Returns:

Type Description
list[tuple[str, str]]

A list of (token, tag) tuples where multi-part verbs are joined into a single token.