Persian Plain Text Reader
This module reads raw text corpora.
PersianPlainTextReader
¶
Bases: PlaintextCorpusReader
A reader for Persian raw text corpora.
This class extends NLTK's PlaintextCorpusReader to provide default tokenization suitable for the Persian language.
Attributes:
| Name | Type | Description |
|---|---|---|
CorpusView |
The class used to create a stream-backed view of the corpus. |
__init__(root, fileids, word_tokenizer=WordTokenizer.tokenize, sent_tokenizer=SentenceTokenizer.tokenize, para_block_reader=read_blankline_block, encoding='utf8')
¶
Initializes the Persian text corpus reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
The root directory of the corpus. |
required |
fileids
|
list
|
A list of file identifiers or a glob pattern for the files. |
required |
word_tokenizer
|
Callable
|
A function used to tokenize words. Defaults to WordTokenizer.tokenize. |
tokenize
|
sent_tokenizer
|
Callable
|
A function used to tokenize sentences. Defaults to SentenceTokenizer.tokenize. |
tokenize
|
para_block_reader
|
Callable
|
A function used to read paragraph blocks. Defaults to read_blankline_block. |
read_blankline_block
|
encoding
|
str
|
The character encoding of the corpus files. Defaults to "utf8". |
'utf8'
|