Arman Reader
This module includes classes and functions for reading the Arman corpus.
The Arman corpus is a Named Entity Recognition (NER) corpus containing 250,015 tagged tokens in 7,682 sentences, stored in IOB format.
ArmanReader
¶
This class includes methods for reading the Arman corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset: 'test' or 'train'. |
'train'
|
__init__(corpus_folder, subset='train')
¶
Initializes the ArmanReader with the corpus folder and subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_folder
|
str
|
Path to the folder containing the corpus files. |
required |
subset
|
str
|
The dataset subset: 'test' or 'train'. Defaults to 'train'. |
'train'
|
sents()
¶
Yields sentences one by one as a list of (token, tag) tuples.
Examples:
>>> arman = ArmanReader("arman")
>>> next(arman.sents())
[('همین', 'O'), ('فکر', 'O'), ('،', 'O'), ('این', 'O'), ('احساس', 'O'), ('را', 'O'), ('به', 'O'), ('من', 'O'), ('میداد', 'O'), ('که', 'O'), ('آزاد', 'O'), ('هستم', 'O'), ('.', 'O')]
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence as a list of (token, tag) tuples. |