Skip to content

NER Reader

This module includes classes and functions for reading the Named Entity Recognition (NER) corpus.

The Named Entity Recognition corpus contains 25 million tagged tokens from Persian Wikipedia in the form of about one million sentences.

NerReader

This class includes functions for reading the Named Entity Recognition (NER) corpus.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required

__init__(corpus_folder)

Initializes the NER reader.

Parameters:

Name Type Description Default
corpus_folder str

Path to the folder containing the corpus files.

required

sents()

Yields sentences one by one as a list of (token, tag) tuples.

Examples:

>>> ner = NerReader("ner")
>>> next(ner.sents())
[('ویکی‌پدیای', 'O'), ('انگلیسی', 'O'), ('در', 'B-DAT'), ('تاریخ', 'I-DAT'), ('۱۵', 'I-DAT'), ('ژانویه', 'I-DAT'), ('۲۰۰۱', 'I-DAT'), ('(', 'O'), ('میلادی', 'B-DAT'), (')', 'O'), ('۲۶', 'B-DAT'), ('دی', 'I-DAT'), ('۱۳۷۹', 'I-DAT'), (')', 'O'), ('به', 'O'), ('صورت', 'O'), ('مکملی', 'O'), ('برای', 'O'), ('دانشنامه', 'O'), ('تخصصی', 'O'), ('نوپدیا', 'O'), ('نوشته', 'O'), ('شد', 'O'), ('.', 'O')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence in the form of a list of (token, tag) tuples.