Skip to content

Persian Plain Text Reader

This module reads raw text corpora.

PersianPlainTextReader

Bases: PlaintextCorpusReader

A reader for Persian raw text corpora.

This class extends NLTK's PlaintextCorpusReader to provide default tokenization suitable for the Persian language.

Attributes:

Name Type Description
CorpusView

The class used to create a stream-backed view of the corpus.

__init__(root, fileids, word_tokenizer=WordTokenizer.tokenize, sent_tokenizer=SentenceTokenizer.tokenize, para_block_reader=read_blankline_block, encoding='utf8')

Initializes the Persian text corpus reader.

Parameters:

Name Type Description Default
root str

The root directory of the corpus.

required
fileids list

A list of file identifiers or a glob pattern for the files.

required
word_tokenizer Callable

A function used to tokenize words. Defaults to WordTokenizer.tokenize.

tokenize
sent_tokenizer Callable

A function used to tokenize sentences. Defaults to SentenceTokenizer.tokenize.

tokenize
para_block_reader Callable

A function used to read paragraph blocks. Defaults to read_blankline_block.

read_blankline_block
encoding str

The character encoding of the corpus files. Defaults to "utf8".

'utf8'