Persica Reader
This module includes classes and functions for reading the Persica corpus.
The Persica corpus contains news articles extracted from the ISNA news agency in eleven categories: sports, economics, culture, religion, history, politics, science, social, education, judicial law, and health. This data has been preprocessed and is ready for use in various natural language processing and data mining applications.
PersicaReader
¶
This class includes functions for reading the Persica corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_file
|
str
|
Path to the corpus file with a .csv extension. |
required |
__init__(csv_file)
¶
Initializes the Persica reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_file
|
str
|
Path to the corpus file with a .csv extension. |
required |
docs()
¶
Yields news articles one by one.
Each news article is a dictionary consisting of these parameters:
- id: Unique identifier.
- title: Title of the news.
- text: Main body of the news.
- date: Publication date.
- time: Publication time.
- category: Primary category.
- category2: Secondary category.
Examples:
>>> persica = PersicaReader('persica.csv')
>>> next(persica.docs())['id']
843656
Yields:
| Type | Description |
|---|---|
dict[str, str]
|
A dictionary containing the next news article's metadata and content. |
texts()
¶
Yields only the text content of the news articles.
This function is provided for convenience; the same result can be
achieved by using the docs()
method and accessing the text property.
Examples:
>>> persica = PersicaReader('persica.csv')
>>> next(persica.texts()).startswith('وزير علوم در جمع استادان نمونه كشور گفت')
True
Yields:
| Type | Description |
|---|---|
str
|
The text content of the next news article. |