Skip to content

Hamshahri Reader

This module includes classes and functions for reading the Hamshahri corpus.

The Hamshahri Corpus contains 318,000 news items from the Hamshahri newspaper from 1996 to 2007 (1375 to 1386 AP). This data was prepared by crawling the Hamshahri website and undergoing several stages of preprocessing and labeling. All news items have a CAT label, and their thematic classification is specified. This corpus was prepared by the Database Research Group of the University of Tehran with the support of the Iran Telecommunication Research Center (ITRC).

HamshahriReader

This class includes functions for reading the Hamshahri corpus.

Parameters:

Name Type Description Default
root str

Path to the folder containing the Hamshahri corpus files.

required

__init__(root)

Initializes the Hamshahri reader.

Parameters:

Name Type Description Default
root str

Path to the folder containing the Hamshahri corpus files.

required

docs()

Yields news documents from the corpus.

Each news item is a dictionary containing the following keys: - id: Unique identifier. - title: News title. - text: News content. - issue: Issue number. - categories_{lang}: Thematic categories (e.g., categories_fa). - date: News date (Persian).

Examples:

>>> hamshahri = HamshahriReader(root='hamshahri')
>>> next(hamshahri.docs())['id']
'HAM2-750403-001'

Yields:

Type Description
dict[str, str]

The next news document dictionary.

texts()

Yields only the text content of the news items.

This function is provided for convenience. The same result can be achieved by using the docs() method and accessing the text field.

Yields:

Type Description
str

The text content of the next news item.