Skip to content

TNews Reader

This module includes classes and functions for reading the TNews corpus.

TNewsReader

A class to read and iterate over the TNews corpus files.

Parameters:

Name Type Description Default
root str

Path to the directory containing the corpus files.

required

__init__(root)

Initializes the TNewsReader with the root directory and a regex cleaner.

Parameters:

Name Type Description Default
root str

Path to the root folder of the corpus.

required

docs()

Returns news articles as an iterator of dictionaries.

Each news article is represented as a dictionary with the following keys: - id: Unique identifier. - title: The main title of the news. - pre-title: Text appearing before the title. - post-title: Text appearing after the title. - text: The full content of the article. - brief: A short summary of the article. - url: The source URL. - category: The news category or topic. - datetime: The publication date and time.

Examples:

>>> tnews = TNewsReader(root='tnews')
>>> next(tnews.docs())['id']
'14092303482300013653'

Yields:

Type Description
dict[str, str]

dict[str, str]: A dictionary containing the metadata and content of the next news article.

texts()

Returns only the text content of the news articles.

This is a convenience method. The same result can be achieved by iterating through docs() and accessing the 'text' key.

Examples:

>>> tnews = TNewsReader(root='tnews')
>>> next(tnews.texts()).startswith('به گزارش "  شبکه اطلاع رسانی اینترنتی بوتیا  " به نقل از ارگ نیوز')
True

Yields:

Name Type Description
str str

The text content of the next news article.