Skip to content

MirasText Reader

This module includes classes and functions for reading the MirasText corpus.

MirasText contains 2,835,414 news items from 250 Persian news agencies.

MirasTextReader

This class includes functions for reading the MirasText corpus.

Parameters:

Name Type Description Default
filename str

Path to the corpus file.

required

__init__(filename)

Initializes the MirasText reader.

Parameters:

Name Type Description Default
filename str

Path to the corpus file.

required

docs()

Yields news documents.

Yields:

Type Description
dict[str, str]

The next news document.

texts()

Yields only the text of the news articles.

This method is provided for convenience; the same result can be achieved by using docs() and accessing the text key.

Examples:

>>> mirastext = MirasTextReader(filename='mirastext.txt')
>>> next(mirastext.texts())[:42]  # first 42 characters of first text
'ایرانی‌ها چقدر از اینترنت استفاده می‌کنند؟'

Yields:

Type Description
str

The text of the next news article.