Skip to content

Degarbayan Reader

This module includes classes and functions for reading the Degarbayan corpus.

The Degarbayan corpus contains 1,523 instances labeled as paraphrases. Paraphrase sentences and phrases are different expressions of the same concept. Data in this corpus is collected from news agencies and presented in three categories: 'Paraphrase', 'Semi-Paraphrase', and 'Not Paraphrase'. This data was tagged using crowdsourcing on Telegram.

DegarbayanReader

This class includes methods for reading the Degarbayan corpus.

Parameters:

Name Type Description Default
root str

Path to the folder containing the corpus files.

required
corpus_file str

The corpus information file. No need to change this if using the standard version.

'corpus_pair.xml'
judge_type str

Determines the labeling scheme. Can be 'three_class' or 'two_class'. In 'three_class', labels are 'Paraphrase', 'SemiParaphrase', and 'NotParaphrase'. In 'two_class', 'SemiParaphrase' is also labeled as 'Paraphrase'.

'three_class'

__init__(root, corpus_file='corpus_pair.xml', judge_type='three_class')

Initializes the DegarbayanReader with the root folder and settings.

Parameters:

Name Type Description Default
root str

Path to the folder containing the corpus files.

required
corpus_file str

The corpus data file. Defaults to 'corpus_pair.xml'.

'corpus_pair.xml'
judge_type str

The classification mode ('three_class' or 'two_class'). Defaults to 'three_class'.

'three_class'

docs()

Returns the documents available in the corpus.

Yields:

Type Description
dict[str, Any]

The next document as a dictionary containing pair information.

pairs()

Returns paraphrase pairs in the form of (original_text, paraphrase_text, label).

Examples:

>>> degarbayan = DegarbayanReader(root='degarbayan')
>>> next(degarbayan.pairs())
('24 نفر نهایی تیم ملی بدون تغییری خاص معرفی شد', 'کی روش 24 بازیکن را به تیم ملی فوتبال دعوت کرد', 'Paraphrase')

Yields:

Type Description
tuple[str, str, str]

The next paraphrase pair as a tuple of (sentence1, sentence2, judge).