Skip to content

POS Tagger

The accuracy of the POS tagger in the current version is 98.8%.

This module contains classes and functions for POS tagging.

POSTagger

Bases: SequenceTagger, TaggerProtocol

Class for POS tagging.

Examples:

>>> # Load from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Or load from a local model file
>>> # tagger = POSTagger(model="pos_tagger.model")

__init__(model=None, data_maker=None, universal_tag=False, repo_id=None, model_filename=None)

Constructor.

Examples:

>>> # Loading from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Loading from a local model file
>>> # tagger = POSTagger(model="resources/pos_tagger.model")

Parameters:

Name Type Description Default
model str | Path | None

Path to the local model file.

None
data_maker Any

Custom data maker function.

None
universal_tag bool

Whether to use universal POS tags.

False
repo_id str | None

Hugging Face repository ID (e.g., "roshan-research/hazm-postagger").

None
model_filename str | None

Filename inside the repository (e.g., "pos_tagger.model").

None

__is_punc(word)

Checks if a word is punctuation.

__universal_converter(tagged_list)

Converts POS tags to universal tags.

data_maker(tokens)

Converts tokens into features.

Examples:

>>> tokens = [['دلم', 'اینجا', 'مانده‌است', '.']]
>>> features = tagger.data_maker(tokens)
>>> features[0][0]['word']
'دلم'

Parameters:

Name Type Description Default
tokens list[Sentence]

A list of sentences, where each sentence is a list of tokens.

required

Returns:

Type Description
list[list[dict[str, Any]]]

A list of lists of feature dictionaries.

features(sentence, index)

Extracts features for a word at a given index.

Parameters:

Name Type Description Default
sentence Sentence

The sentence containing the word.

required
index int

The index of the word.

required

Returns:

Type Description
dict[str, Any]

A dictionary of features.

tag(tokens)

Tags a single sentence.

Examples:

>>> tagger.tag(['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]

Parameters:

Name Type Description Default
tokens Sentence

A list of tokens representing a sentence.

required

Returns:

Type Description
TaggedSentence

A tagged sentence (list of (word, tag) tuples).

tag_sents(sentences)

Tags multiple sentences.

Examples:

>>> tagger.tag_sents([['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.']])
[[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]]

Parameters:

Name Type Description Default
sentences list[Sentence]

A list of sentences to tag.

required

Returns:

Type Description
list[TaggedSentence]

A list of tagged sentences.

SpacyPOSTagger

Bases: POSTagger

POS Tagger based on spaCy.

__init__(model_path=None, using_gpu=False, gpu_id=0, repo_id=None, model_filename=None)

Constructor.

Parameters:

Name Type Description Default
model_path str | Path | None

Path to the local model directory.

None
using_gpu bool

Whether to use GPU.

False
gpu_id int

The ID of the GPU to use.

0
repo_id str | None

Hugging Face repository ID.

None
model_filename str | None

Filename (unused for spaCy models).

None

evaluate(test_sents, batch_size=128)

Evaluates the model.

Parameters:

Name Type Description Default
test_sents list[TaggedSentence]

A list of tagged sentences for testing.

required
batch_size int

Batch size for processing.

128

tag(tokens, universal_tag=True)

Tags a single sentence.

Parameters:

Name Type Description Default
tokens Sentence

A list of tokens representing a sentence.

required
universal_tag bool

Whether to use universal POS tags.

True

Returns:

Type Description
TaggedSentence

A tagged sentence.

tag_sents(sents, universal_tag=True, batch_size=128)

Tags multiple sentences.

Parameters:

Name Type Description Default
sents list[Sentence]

A list of sentences to tag.

required
universal_tag bool

Whether to use universal POS tags.

True
batch_size int

Batch size for processing.

128

Returns:

Type Description
list[TaggedSentence]

A list of tagged sentences.

train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)

Trains the spaCy model.

Parameters:

Name Type Description Default
train_dataset list[TaggedSentence]

The training dataset.

required
test_dataset list[TaggedSentence]

The testing dataset.

required
data_directory str

Directory to save processed data.

required
base_config_file str

Path to the base configuration file.

required
train_config_path str

Path to the training configuration file.

required
output_dir str

Directory to save the trained model.

required
use_direct_config bool

Whether to use the configuration file directly.

False

StanfordPOSTagger

Bases: StanfordPOSTagger

Wrapper for Stanford POS Tagger.

__init__(model_filename, path_to_jar, *args, **kwargs)

Constructor.

Parameters:

Name Type Description Default
model_filename str

Path to the model file.

required
path_to_jar str

Path to the Stanford POS Tagger JAR file.

required
*args Any

Variable length argument list.

()
**kwargs Any

Arbitrary keyword arguments.

{}

tag(tokens)

Tags a single sentence.

Parameters:

Name Type Description Default
tokens Sentence

A list of tokens representing a sentence.

required

Returns:

Type Description
TaggedSentence

A tagged sentence.

tag_sents(sentences)

Tags multiple sentences.

Parameters:

Name Type Description Default
sentences list[Sentence]

A list of sentences to tag.

required

Returns:

Type Description
list[TaggedSentence]

A list of tagged sentences.