Skip to content

Arman Reader

This module includes classes and functions for reading the Arman corpus.

The Arman corpus is a Named Entity Recognition (NER) corpus containing 250,015 tagged tokens in 7,682 sentences, stored in IOB format.

ArmanReader

This class includes methods for reading the Arman corpus.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset: 'test' or 'train'.

'train'

__init__(corpus_folder, subset='train')

Initializes the ArmanReader with the corpus folder and subset.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required
subset str

The dataset subset: 'test' or 'train'. Defaults to 'train'.

'train'

sents()

Yields sentences one by one as a list of (token, tag) tuples.

Examples:

>>> arman = ArmanReader("arman")
>>> next(arman.sents())
[('همین', 'O'), ('فکر', 'O'), ('،', 'O'), ('این', 'O'), ('احساس', 'O'), ('را', 'O'), ('به', 'O'), ('من', 'O'), ('می‌داد', 'O'), ('که', 'O'), ('آزاد', 'O'), ('هستم', 'O'), ('.', 'O')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence as a list of (token, tag) tuples.