Peykare Reader

This module includes classes and functions for reading the Peykare corpus.

Peykare is a collection of formal written and spoken Persian texts collected from real sources such as newspapers, websites, and pre-typed documents, which have been corrected and tagged. The volume of this data is approximately 100 million words, gathered from various sources with a high degree of diversity. 10 million words of this corpus have been manually tagged by linguistics students using 882 syntactic-semantic tags, and each file is classified by its subject and source. This corpus, prepared by the Intelligent Signal Processing Research Center, is suitable for use in training language models and other natural language processing projects.

`PeykareReader` ¶

A reader for the Peykare corpus.

Parameters:

Name	Type	Description	Default
`root`	`str`	Path to the root folder containing corpus files.	required
`joined_verb_parts`	`bool`	If `True`, multi-part verbs will be returned as joined tokens.	`True`
`pos_map`	`str`	A function to map fine-grained tags to coarse-grained ones.	`coarse_pos_e`
`universal_pos`	`bool`	If `True`, uses the universal POS tagset.	`False`

`init(root, joined_verb_parts=True, pos_map=coarse_pos_e, universal_pos=False)` ¶

Initializes the PeykareReader.

Parameters:

Name	Type	Description	Default
`root`	`str`	Path to the folder containing the corpus files.	required
`joined_verb_parts`	`bool`	If `True`, multi-part verbs will be joined using an underscore.	`True`
`pos_map`	`str`	A mapper for fine-grained to coarse-grained tags.	`coarse_pos_e`
`universal_pos`	`bool`	If `True`, uses universal POS tags.	`False`

`doc_to_sents(document)` ¶

Converts an input document into a list of sentences.

Each sentence is a list of (word, tag) tuples.

Parameters:

Name	Type	Description	Default
`document`	`str`	The raw document text to be converted.	required

Yields:

Type	Description
`list[tuple[str, str]]`	The next sentence in the form of a list of `(word, tag)` tuples.

`docs()` ¶

Returns documents as raw text.

Yields:

Type	Description
`str`	The raw text of the next document.

`sents()` ¶

Returns sentences of the corpus as a list of (token, tag) tuples.

Examples:

>>> peykare = PeykareReader(root='peykare')
>>> next(peykare.sents())
[('دیرزمانی', 'N'), ('از', 'P'), ('راه‌اندازی', 'N,EZ'), ('شبکه‌ی', 'N,EZ'), ('خبر', 'N,EZ'), ('الجزیره', 'N'), ('نمی‌گذرد', 'V'), ('،', 'PUNC'), ('اما', 'CONJ'), ('این', 'DET'), ('شبکه‌ی', 'N,EZ'), ('خبری', 'AJ,EZ'), ('عربی', 'N'), ('بسیار', 'ADV'), ('سریع', 'ADV'), ('توانسته', 'V'), ('در', 'P'), ('میان', 'N,EZ'), ('شبکه‌های', 'N,EZ'), ('عظیم', 'AJ,EZ'), ('خبری', 'AJ'), ('و', 'CONJ'), ('بنگاه‌های', 'N,EZ'), ('چندرسانه‌ای', 'AJ,EZ'), ('دنیا', 'N'), ('خودی', 'N'), ('نشان', 'N'), ('دهد', 'V'), ('.', 'PUNC')]

Yields:

Type	Description
`list[tuple[str, str]]`	The next sentence in the form of a list of `(token, tag)` tuples.

`coarse_pos_e(tags, word)` ¶

Converts fine-grained tags to coarse-grained POS tags.

Examples:

>>> coarse_pos_e(['N','COM','SING'],'الجزیره')
'N'

Parameters:

Name	Type	Description	Default
`tags`	`list[str]`	List of fine-grained tags.	required
`word`	`str`	The word associated with the tags.	required

Returns:

Type	Description
`list[str]`	List of coarse-grained tags.

`coarse_pos_u(tags, word)` ¶

Converts fine-grained tags to coarse-grained universal POS tags.

Examples:

>>> coarse_pos_u(['N','COM','SING'], 'الجزیره')
'NOUN'

Parameters:

Name	Type	Description	Default
`tags`	`list[str]`	List of fine-grained tags.	required
`word`	`str`	The word to be converted to a universal tag.	required

Returns:

Type	Description
`list[str]`	List of coarse-grained universal POS tags.

`join_verb_parts(sentence)` ¶

Joins multi-part verbs with an underscore character.

Takes a sentence in the form of a list of (token, tag) tuples and joins tokens belonging to multi-part verbs using an underscore (_).

Examples:

>>> join_verb_parts([('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده', 'AJ'), ('شد', 'V'), ('.', 'PUNC')])
[('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده_شد', 'V'), ('.', 'PUNC')]

Parameters:

Name	Type	Description	Default
`sentence`	`list[tuple[str, str]]`	Sentence as a list of `(token, tag)` tuples.	required

Returns:

Type	Description
`list[tuple[str, str]]`	A list of `(token, tag)` tuples where multi-part verbs are joined into a single token.

Peykare Reader

PeykareReader ¶

__init__(root, joined_verb_parts=True, pos_map=coarse_pos_e, universal_pos=False) ¶

doc_to_sents(document) ¶

docs() ¶

sents() ¶

coarse_pos_e(tags, word) ¶

coarse_pos_u(tags, word) ¶

join_verb_parts(sentence) ¶

`PeykareReader` ¶

`init(root, joined_verb_parts=True, pos_map=coarse_pos_e, universal_pos=False)` ¶

`doc_to_sents(document)` ¶

`docs()` ¶

`sents()` ¶

`coarse_pos_e(tags, word)` ¶

`coarse_pos_u(tags, word)` ¶

`join_verb_parts(sentence)` ¶