TNews Reader
This module includes classes and functions for reading the TNews corpus.
TNewsReader
¶
A class to read and iterate over the TNews corpus files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the directory containing the corpus files. |
required |
__init__(root)
¶
Initializes the TNewsReader with the root directory and a regex cleaner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the root folder of the corpus. |
required |
docs()
¶
Returns news articles as an iterator of dictionaries.
Each news article is represented as a dictionary with the following keys: - id: Unique identifier. - title: The main title of the news. - pre-title: Text appearing before the title. - post-title: Text appearing after the title. - text: The full content of the article. - brief: A short summary of the article. - url: The source URL. - category: The news category or topic. - datetime: The publication date and time.
Examples:
>>> tnews = TNewsReader(root='tnews')
>>> next(tnews.docs())['id']
'14092303482300013653'
Yields:
| Type | Description |
|---|---|
dict[str, str]
|
dict[str, str]: A dictionary containing the metadata and content of the next news article. |
texts()
¶
Returns only the text content of the news articles.
This is a convenience method. The same result can be achieved by iterating through docs() and accessing the 'text' key.
Examples:
>>> tnews = TNewsReader(root='tnews')
>>> next(tnews.texts()).startswith('به گزارش " شبکه اطلاع رسانی اینترنتی بوتیا " به نقل از ارگ نیوز')
True
Yields:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text content of the next news article. |