POS Tagger

The accuracy of the POS tagger in the current version is 98.8%.

This module contains classes and functions for POS tagging.

`POSTagger` ¶

Bases: SequenceTagger, TaggerProtocol

Class for POS tagging.

Examples:

>>> # Load from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Or load from a local model file
>>> # tagger = POSTagger(model="pos_tagger.model")

`init(model=None, data_maker=None, universal_tag=False, repo_id=None, model_filename=None)` ¶

Constructor.

Examples:

>>> # Loading from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Loading from a local model file
>>> # tagger = POSTagger(model="resources/pos_tagger.model")

Parameters:

Name	Type	Description	Default
`model`	`str \| Path \| None`	Path to the local model file.	`None`
`data_maker`	`Any`	Custom data maker function.	`None`
`universal_tag`	`bool`	Whether to use universal POS tags.	`False`
`repo_id`	`str \| None`	Hugging Face repository ID (e.g., "roshan-research/hazm-postagger").	`None`
`model_filename`	`str \| None`	Filename inside the repository (e.g., "pos_tagger.model").	`None`

`__is_punc(word)` ¶

Checks if a word is punctuation.

`__universal_converter(tagged_list)` ¶

Converts POS tags to universal tags.

`data_maker(tokens)` ¶

Converts tokens into features.

Examples:

>>> tokens = [['دلم', 'اینجا', 'مانده‌است', '.']]
>>> features = tagger.data_maker(tokens)
>>> features[0][0]['word']
'دلم'

Parameters:

Name	Type	Description	Default
`tokens`	`list[Sentence]`	A list of sentences, where each sentence is a list of tokens.	required

Returns:

Type	Description
`list[list[dict[str, Any]]]`	A list of lists of feature dictionaries.

`features(sentence, index)` ¶

Extracts features for a word at a given index.

Parameters:

Name	Type	Description	Default
`sentence`	`Sentence`	The sentence containing the word.	required
`index`	`int`	The index of the word.	required

Returns:

Type	Description
`dict[str, Any]`	A dictionary of features.

`tag(tokens)` ¶

Tags a single sentence.

Examples:

>>> tagger.tag(['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]

Parameters:

Name	Type	Description	Default
`tokens`	`Sentence`	A list of tokens representing a sentence.	required

Returns:

Type	Description
`TaggedSentence`	A tagged sentence (list of (word, tag) tuples).

`tag_sents(sentences)` ¶

Tags multiple sentences.

Examples:

>>> tagger.tag_sents([['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.']])
[[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]]

Parameters:

Name	Type	Description	Default
`sentences`	`list[Sentence]`	A list of sentences to tag.	required

Returns:

Type	Description
`list[TaggedSentence]`	A list of tagged sentences.

`SpacyPOSTagger` ¶

Bases: POSTagger

POS Tagger based on spaCy.

`init(model_path=None, using_gpu=False, gpu_id=0, repo_id=None, model_filename=None)` ¶

Constructor.

Parameters:

Name	Type	Description	Default
`model_path`	`str \| Path \| None`	Path to the local model directory.	`None`
`using_gpu`	`bool`	Whether to use GPU.	`False`
`gpu_id`	`int`	The ID of the GPU to use.	`0`
`repo_id`	`str \| None`	Hugging Face repository ID.	`None`
`model_filename`	`str \| None`	Filename (unused for spaCy models).	`None`

`evaluate(test_sents, batch_size=128)` ¶

Evaluates the model.

Parameters:

Name	Type	Description	Default
`test_sents`	`list[TaggedSentence]`	A list of tagged sentences for testing.	required
`batch_size`	`int`	Batch size for processing.	`128`

`tag(tokens, universal_tag=True)` ¶

Tags a single sentence.

Parameters:

Name	Type	Description	Default
`tokens`	`Sentence`	A list of tokens representing a sentence.	required
`universal_tag`	`bool`	Whether to use universal POS tags.	`True`

Returns:

Type	Description
`TaggedSentence`	A tagged sentence.

`tag_sents(sents, universal_tag=True, batch_size=128)` ¶

Tags multiple sentences.

Parameters:

Name	Type	Description	Default
`sents`	`list[Sentence]`	A list of sentences to tag.	required
`universal_tag`	`bool`	Whether to use universal POS tags.	`True`
`batch_size`	`int`	Batch size for processing.	`128`

Returns:

Type	Description
`list[TaggedSentence]`	A list of tagged sentences.

`train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)` ¶

Trains the spaCy model.

Parameters:

Name	Type	Description	Default
`train_dataset`	`list[TaggedSentence]`	The training dataset.	required
`test_dataset`	`list[TaggedSentence]`	The testing dataset.	required
`data_directory`	`str`	Directory to save processed data.	required
`base_config_file`	`str`	Path to the base configuration file.	required
`train_config_path`	`str`	Path to the training configuration file.	required
`output_dir`	`str`	Directory to save the trained model.	required
`use_direct_config`	`bool`	Whether to use the configuration file directly.	`False`

`StanfordPOSTagger` ¶

Bases: StanfordPOSTagger

Wrapper for Stanford POS Tagger.

`init(model_filename, path_to_jar, *args, **kwargs)` ¶

Constructor.

Parameters:

Name	Type	Description	Default
`model_filename`	`str`	Path to the model file.	required
`path_to_jar`	`str`	Path to the Stanford POS Tagger JAR file.	required
`*args`	`Any`	Variable length argument list.	`()`
`**kwargs`	`Any`	Arbitrary keyword arguments.	`{}`

`tag(tokens)` ¶

Tags a single sentence.

Parameters:

Name	Type	Description	Default
`tokens`	`Sentence`	A list of tokens representing a sentence.	required

Returns:

Type	Description
`TaggedSentence`	A tagged sentence.

`tag_sents(sentences)` ¶

Tags multiple sentences.

Parameters:

Name	Type	Description	Default
`sentences`	`list[Sentence]`	A list of sentences to tag.	required

Returns:

Type	Description
`list[TaggedSentence]`	A list of tagged sentences.

POS Tagger

POSTagger ¶

__init__(model=None, data_maker=None, universal_tag=False, repo_id=None, model_filename=None) ¶

__is_punc(word) ¶

__universal_converter(tagged_list) ¶

data_maker(tokens) ¶

features(sentence, index) ¶

tag(tokens) ¶

tag_sents(sentences) ¶

SpacyPOSTagger ¶

__init__(model_path=None, using_gpu=False, gpu_id=0, repo_id=None, model_filename=None) ¶

evaluate(test_sents, batch_size=128) ¶

tag(tokens, universal_tag=True) ¶

tag_sents(sents, universal_tag=True, batch_size=128) ¶

train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False) ¶

StanfordPOSTagger ¶

__init__(model_filename, path_to_jar, *args, **kwargs) ¶

tag(tokens) ¶

tag_sents(sentences) ¶

`POSTagger` ¶

`init(model=None, data_maker=None, universal_tag=False, repo_id=None, model_filename=None)` ¶

`__is_punc(word)` ¶

`__universal_converter(tagged_list)` ¶

`data_maker(tokens)` ¶

`features(sentence, index)` ¶

`tag(tokens)` ¶

`tag_sents(sentences)` ¶

`SpacyPOSTagger` ¶

`init(model_path=None, using_gpu=False, gpu_id=0, repo_id=None, model_filename=None)` ¶

`evaluate(test_sents, batch_size=128)` ¶

`tag(tokens, universal_tag=True)` ¶

`tag_sents(sents, universal_tag=True, batch_size=128)` ¶

`train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)` ¶

`StanfordPOSTagger` ¶

`init(model_filename, path_to_jar, *args, **kwargs)` ¶

`tag(tokens)` ¶

`tag_sents(sentences)` ¶