PN Summary Reader
This module includes classes and functions for reading the pn-summary corpus.
The pn-summary corpus was prepared to help deep learning systems and build better models for more accurate Persian text summarization. This corpus includes 93,207 cleaned news texts extracted from 6 Persian news agencies out of approximately 200,000 news items.
PnSummaryReader
¶
This class includes functions for reading the pn-summary corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset; can be |
'train'
|
__init__(corpus_folder, subset='train')
¶
Initializes the PnSummaryReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset; can be |
'train'
|
docs()
¶
Yields news articles one by one.
Examples:
>>> pn_summary = PnSummaryReader("pn-summary", "test")
>>> next(pn_summary.docs())
(
'ff49386698b87be4fc3943bd3cf88987157e1d47',
'کاهش ۵۸ درصدی مصرف نفت کوره منطقه سبزوار',
'مدیر شرکت ملی پخش فرآوردههای نفتی منطقه سبزوار به خبرنگار شانا، گفت...,
'مصرف نفت کوره منطقه سبزوار در بهار امسال، نسبت به مدت مشابه پارسال، ۵۸ درصد کاهش یافت.',
'Oil-Energy',
['پالایش و پخش'],
'Shana',
'https://www.shana.ir/news/243726/%DA%A9%D8%A7%D9%87%D8...'
)
Yields:
| Type | Description |
|---|---|
tuple[str, str, str, str, str, list[str], str, str]
|
The next news entry in the format |