Wikipedia Reader
This module includes classes and functions for reading the Wikipedia corpus.
The Wikipedia corpus is a massive corpus containing all Persian Wikipedia articles, updated every two months. For more information about this corpus, you can visit its main page.
WikipediaReader
¶
This class includes functions for reading the Wikipedia corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fawiki_dump
|
str
|
Path to the corpus dump file. |
required |
n_jobs
|
int
|
Number of CPU cores for parallel processing. |
2
|
__init__(fawiki_dump, n_jobs=2)
¶
Initializes the Wikipedia reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fawiki_dump
|
str
|
Path to the corpus dump file. |
required |
n_jobs
|
int
|
Number of CPU cores for parallel processing. |
2
|
docs()
¶
Yields articles from the corpus.
Each article is a dictionary containing the following parameters: - id: The article identifier. - title: The title of the article. - text: The content of the article. - date: The web version date. - url: The page URL.
Examples:
>>> wikipedia = WikipediaReader('fawiki-latest-pages-articles.xml.bz2')
>>> next(wikipedia.docs())['id']
Yields:
| Type | Description |
|---|---|
dict[str, str]
|
A dictionary containing the next article's data. |
texts()
¶
Yields only the text of the articles.
This function is provided for convenience. It is equivalent to using the
docs() method
and retrieving the text property.
Examples:
>>> wikipedia = WikipediaReader('fawiki-latest-pages-articles.xml.bz2')
>>> next(wikipedia.texts())[:30]
Yields:
| Type | Description |
|---|---|
str
|
The text of the next article. |