Hamshahri Reader
This module includes classes and functions for reading the Hamshahri corpus.
The Hamshahri Corpus contains 318,000 news items from the Hamshahri newspaper from 1996 to 2007 (1375 to 1386 AP). This data was prepared by crawling the Hamshahri website and undergoing several stages of preprocessing and labeling. All news items have a CAT label, and their thematic classification is specified. This corpus was prepared by the Database Research Group of the University of Tehran with the support of the Iran Telecommunication Research Center (ITRC).
HamshahriReader
¶
This class includes functions for reading the Hamshahri corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing the Hamshahri corpus files. |
required |
__init__(root)
¶
Initializes the Hamshahri reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing the Hamshahri corpus files. |
required |
docs()
¶
Yields news documents from the corpus.
Each news item is a dictionary containing the following keys:
- id: Unique identifier.
- title: News title.
- text: News content.
- issue: Issue number.
- categories_{lang}: Thematic categories (e.g., categories_fa).
- date: News date (Persian).
Examples:
>>> hamshahri = HamshahriReader(root='hamshahri')
>>> next(hamshahri.docs())['id']
'HAM2-750403-001'
Yields:
| Type | Description |
|---|---|
dict[str, str]
|
The next news document dictionary. |