Skip to content

Naab Reader

This module includes classes and functions for reading the Naab corpus.

The Naab corpus consists of 130 GB of cleaned Persian text comprising 250 million paragraphs and 15 billion words.

NaabReader

This class includes functions for reading the Naab corpus.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset: test or train.

'train'

__init__(corpus_folder, subset='train')

Initializes the Naab reader.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset: test or train.

'train'

sents()

Yields sentences from the corpus one by one.

Examples:

>>> naab = NaabReader("naab", "test")
>>> next(naab.sents())
این وبلاگ زیر نظر وب‌های زیر به کار خود ادامه می‌دهد

Yields:

Type Description
str

The next sentence in the corpus.