MirasText Reader
This module includes classes and functions for reading the MirasText corpus.
MirasText contains 2,835,414 news items from 250 Persian news agencies.
MirasTextReader
¶
This class includes functions for reading the MirasText corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
Path to the corpus file. |
required |
__init__(filename)
¶
Initializes the MirasText reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
Path to the corpus file. |
required |
docs()
¶
Yields news documents.
Yields:
| Type | Description |
|---|---|
dict[str, str]
|
The next news document. |
texts()
¶
Yields only the text of the news articles.
This method is provided for convenience; the same result can be achieved
by using docs()
and accessing the text key.
Examples:
>>> mirastext = MirasTextReader(filename='mirastext.txt')
>>> next(mirastext.texts())[:42] # first 42 characters of first text
'ایرانیها چقدر از اینترنت استفاده میکنند؟'
Yields:
| Type | Description |
|---|---|
str
|
The text of the next news article. |