PN Summary Reader

This module includes classes and functions for reading the pn-summary corpus.

The pn-summary corpus was prepared to help deep learning systems and build better models for more accurate Persian text summarization. This corpus includes 93,207 cleaned news texts extracted from 6 Persian news agencies out of approximately 200,000 news items.

`PnSummaryReader` ¶

This class includes functions for reading the pn-summary corpus.

Parameters:

Name	Type	Description	Default
`corpus_folder`	`str`	Path to the folder containing the corpus files.	required
`subset`	`str`	The dataset subset; can be `test`, `train`, or `dev`.	`'train'`

`init(corpus_folder, subset='train')` ¶

Initializes the PnSummaryReader.

Parameters:

Name	Type	Description	Default
`corpus_folder`	`str`	Path to the folder containing the corpus files.	required
`subset`	`str`	The dataset subset; can be `test`, `train`, or `dev`.	`'train'`

`docs()` ¶

Yields news articles one by one.

Examples:

>>> pn_summary = PnSummaryReader("pn-summary", "test")
>>> next(pn_summary.docs())
(
    'ff49386698b87be4fc3943bd3cf88987157e1d47',
    'کاهش ۵۸ درصدی مصرف نفت کوره منطقه سبزوار',
    'مدیر شرکت ملی پخش فرآورده‌های نفتی منطقه سبزوار به خبرنگار شانا، گفت...,
    'مصرف نفت کوره منطقه سبزوار در بهار امسال، نسبت به مدت مشابه پارسال، ۵۸ درصد کاهش یافت.',
    'Oil-Energy',
    ['پالایش و پخش'],
    'Shana',
    'https://www.shana.ir/news/243726/%DA%A9%D8%A7%D9%87%D8...'
)

Yields:

Type	Description
`tuple[str, str, str, str, str, list[str], str, str]`	The next news entry in the format `(id, title, article, summary, category_en, [category_fa1, category_fa2, ...], source, link)`.

PN Summary Reader