Naab Reader
This module includes classes and functions for reading the Naab corpus.
The Naab corpus consists of 130 GB of cleaned Persian text comprising 250 million paragraphs and 15 billion words.
NaabReader
¶
This class includes functions for reading the Naab corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset: |
'train'
|
__init__(corpus_folder, subset='train')
¶
Initializes the Naab reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset: |
'train'
|
sents()
¶
Yields sentences from the corpus one by one.
Examples:
>>> naab = NaabReader("naab", "test")
>>> next(naab.sents())
این وبلاگ زیر نظر وبهای زیر به کار خود ادامه میدهد
Yields:
| Type | Description |
|---|---|
str
|
The next sentence in the corpus. |