Skip to content

Persica Reader

This module includes classes and functions for reading the Persica corpus.

The Persica corpus contains news articles extracted from the ISNA news agency in eleven categories: sports, economics, culture, religion, history, politics, science, social, education, judicial law, and health. This data has been preprocessed and is ready for use in various natural language processing and data mining applications.

PersicaReader

This class includes functions for reading the Persica corpus.

Parameters:

Name Type Description Default
csv_file str

Path to the corpus file with a .csv extension.

required

__init__(csv_file)

Initializes the Persica reader.

Parameters:

Name Type Description Default
csv_file str

Path to the corpus file with a .csv extension.

required

docs()

Yields news articles one by one.

Each news article is a dictionary consisting of these parameters: - id: Unique identifier. - title: Title of the news. - text: Main body of the news. - date: Publication date. - time: Publication time. - category: Primary category. - category2: Secondary category.

Examples:

>>> persica = PersicaReader('persica.csv')
>>> next(persica.docs())['id']
843656

Yields:

Type Description
dict[str, str]

A dictionary containing the next news article's metadata and content.

texts()

Yields only the text content of the news articles.

This function is provided for convenience; the same result can be achieved by using the docs() method and accessing the text property.

Examples:

>>> persica = PersicaReader('persica.csv')
>>> next(persica.texts()).startswith('وزير علوم در جمع استادان نمونه كشور گفت')
True

Yields:

Type Description
str

The text content of the next news article.