Skip to content

Wikipedia Reader

This module includes classes and functions for reading the Wikipedia corpus.

The Wikipedia corpus is a massive corpus containing all Persian Wikipedia articles, updated every two months. For more information about this corpus, you can visit its main page.

WikipediaReader

This class includes functions for reading the Wikipedia corpus.

Parameters:

Name Type Description Default
fawiki_dump str

Path to the corpus dump file.

required
n_jobs int

Number of CPU cores for parallel processing.

2

__init__(fawiki_dump, n_jobs=2)

Initializes the Wikipedia reader.

Parameters:

Name Type Description Default
fawiki_dump str

Path to the corpus dump file.

required
n_jobs int

Number of CPU cores for parallel processing.

2

docs()

Yields articles from the corpus.

Each article is a dictionary containing the following parameters: - id: The article identifier. - title: The title of the article. - text: The content of the article. - date: The web version date. - url: The page URL.

Examples:

>>> wikipedia = WikipediaReader('fawiki-latest-pages-articles.xml.bz2')
>>> next(wikipedia.docs())['id']

Yields:

Type Description
dict[str, str]

A dictionary containing the next article's data.

texts()

Yields only the text of the articles.

This function is provided for convenience. It is equivalent to using the docs() method and retrieving the text property.

Examples:

>>> wikipedia = WikipediaReader('fawiki-latest-pages-articles.xml.bz2')
>>> next(wikipedia.texts())[:30]

Yields:

Type Description
str

The text of the next article.