Peykare Reader
This module includes classes and functions for reading the Peykare corpus.
Peykare is a collection of formal written and spoken Persian texts collected from real sources such as newspapers, websites, and pre-typed documents, which have been corrected and tagged. The volume of this data is approximately 100 million words, gathered from various sources with a high degree of diversity. 10 million words of this corpus have been manually tagged by linguistics students using 882 syntactic-semantic tags, and each file is classified by its subject and source. This corpus, prepared by the Intelligent Signal Processing Research Center, is suitable for use in training language models and other natural language processing projects.
PeykareReader
¶
A reader for the Peykare corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the root folder containing corpus files. |
required |
joined_verb_parts
|
bool
|
If |
True
|
pos_map
|
str
|
A function to map fine-grained tags to coarse-grained ones. |
coarse_pos_e
|
universal_pos
|
bool
|
If |
False
|
__init__(root, joined_verb_parts=True, pos_map=coarse_pos_e, universal_pos=False)
¶
Initializes the PeykareReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing the corpus files. |
required |
joined_verb_parts
|
bool
|
If |
True
|
pos_map
|
str
|
A mapper for fine-grained to coarse-grained tags. |
coarse_pos_e
|
universal_pos
|
bool
|
If |
False
|
doc_to_sents(document)
¶
Converts an input document into a list of sentences.
Each sentence is a list of (word, tag) tuples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
str
|
The raw document text to be converted. |
required |
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence in the form of a list of |
docs()
¶
Returns documents as raw text.
Yields:
| Type | Description |
|---|---|
str
|
The raw text of the next document. |
sents()
¶
Returns sentences of the corpus as a list of (token, tag) tuples.
Examples:
>>> peykare = PeykareReader(root='peykare')
>>> next(peykare.sents())
[('دیرزمانی', 'N'), ('از', 'P'), ('راهاندازی', 'N,EZ'), ('شبکهی', 'N,EZ'), ('خبر', 'N,EZ'), ('الجزیره', 'N'), ('نمیگذرد', 'V'), ('،', 'PUNC'), ('اما', 'CONJ'), ('این', 'DET'), ('شبکهی', 'N,EZ'), ('خبری', 'AJ,EZ'), ('عربی', 'N'), ('بسیار', 'ADV'), ('سریع', 'ADV'), ('توانسته', 'V'), ('در', 'P'), ('میان', 'N,EZ'), ('شبکههای', 'N,EZ'), ('عظیم', 'AJ,EZ'), ('خبری', 'AJ'), ('و', 'CONJ'), ('بنگاههای', 'N,EZ'), ('چندرسانهای', 'AJ,EZ'), ('دنیا', 'N'), ('خودی', 'N'), ('نشان', 'N'), ('دهد', 'V'), ('.', 'PUNC')]
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence in the form of a list of |
coarse_pos_e(tags, word)
¶
Converts fine-grained tags to coarse-grained POS tags.
Examples:
>>> coarse_pos_e(['N','COM','SING'],'الجزیره')
'N'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
List of fine-grained tags. |
required |
word
|
str
|
The word associated with the tags. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of coarse-grained tags. |
coarse_pos_u(tags, word)
¶
Converts fine-grained tags to coarse-grained universal POS tags.
Examples:
>>> coarse_pos_u(['N','COM','SING'], 'الجزیره')
'NOUN'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
List of fine-grained tags. |
required |
word
|
str
|
The word to be converted to a universal tag. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of coarse-grained universal POS tags. |
join_verb_parts(sentence)
¶
Joins multi-part verbs with an underscore character.
Takes a sentence in the form of a list of (token, tag) tuples and joins
tokens belonging to multi-part verbs using an underscore (_).
Examples:
>>> join_verb_parts([('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده', 'AJ'), ('شد', 'V'), ('.', 'PUNC')])
[('اولین', 'AJ'), ('سیاره', 'Ne'), ('خارج', 'AJ'), ('از', 'P'), ('منظومه', 'Ne'), ('شمسی', 'AJ'), ('دیده_شد', 'V'), ('.', 'PUNC')]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
list[tuple[str, str]]
|
Sentence as a list of |
required |
Returns:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
A list of |