Degarbayan Reader
This module includes classes and functions for reading the Degarbayan corpus.
The Degarbayan corpus contains 1,523 instances labeled as paraphrases. Paraphrase sentences and phrases are different expressions of the same concept. Data in this corpus is collected from news agencies and presented in three categories: 'Paraphrase', 'Semi-Paraphrase', and 'Not Paraphrase'. This data was tagged using crowdsourcing on Telegram.
DegarbayanReader
¶
This class includes methods for reading the Degarbayan corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing the corpus files. |
required |
corpus_file
|
str
|
The corpus information file. No need to change this if using the standard version. |
'corpus_pair.xml'
|
judge_type
|
str
|
Determines the labeling scheme. Can be 'three_class' or 'two_class'. In 'three_class', labels are 'Paraphrase', 'SemiParaphrase', and 'NotParaphrase'. In 'two_class', 'SemiParaphrase' is also labeled as 'Paraphrase'. |
'three_class'
|
__init__(root, corpus_file='corpus_pair.xml', judge_type='three_class')
¶
Initializes the DegarbayanReader with the root folder and settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing the corpus files. |
required |
corpus_file
|
str
|
The corpus data file. Defaults to 'corpus_pair.xml'. |
'corpus_pair.xml'
|
judge_type
|
str
|
The classification mode ('three_class' or 'two_class'). Defaults to 'three_class'. |
'three_class'
|
docs()
¶
Returns the documents available in the corpus.
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
The next document as a dictionary containing pair information. |
pairs()
¶
Returns paraphrase pairs in the form of (original_text, paraphrase_text, label).
Examples:
>>> degarbayan = DegarbayanReader(root='degarbayan')
>>> next(degarbayan.pairs())
('24 نفر نهایی تیم ملی بدون تغییری خاص معرفی شد', 'کی روش 24 بازیکن را به تیم ملی فوتبال دعوت کرد', 'Paraphrase')
Yields:
| Type | Description |
|---|---|
tuple[str, str, str]
|
The next paraphrase pair as a tuple of (sentence1, sentence2, judge). |