Skip to content

PN Summary Reader

This module includes classes and functions for reading the pn-summary corpus.

The pn-summary corpus was prepared to help deep learning systems and build better models for more accurate Persian text summarization. This corpus includes 93,207 cleaned news texts extracted from 6 Persian news agencies out of approximately 200,000 news items.

PnSummaryReader

This class includes functions for reading the pn-summary corpus.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset; can be test, train, or dev.

'train'

__init__(corpus_folder, subset='train')

Initializes the PnSummaryReader.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset; can be test, train, or dev.

'train'

docs()

Yields news articles one by one.

Examples:

>>> pn_summary = PnSummaryReader("pn-summary", "test")
>>> next(pn_summary.docs())
(
    'ff49386698b87be4fc3943bd3cf88987157e1d47',
    'کاهش ۵۸ درصدی مصرف نفت کوره منطقه سبزوار',
    'مدیر شرکت ملی پخش فرآورده‌های نفتی منطقه سبزوار به خبرنگار شانا، گفت...,
    'مصرف نفت کوره منطقه سبزوار در بهار امسال، نسبت به مدت مشابه پارسال، ۵۸ درصد کاهش یافت.',
    'Oil-Energy',
    ['پالایش و پخش'],
    'Shana',
    'https://www.shana.ir/news/243726/%DA%A9%D8%A7%D9%87%D8...'
)

Yields:

Type Description
tuple[str, str, str, str, str, list[str], str, str]

The next news entry in the format (id, title, article, summary, category_en, [category_fa1, category_fa2, ...], source, link).