POS Tagger
The accuracy of the POS tagger in the current version is 98.8%.
This module contains classes and functions for POS tagging.
POSTagger
¶
Bases: SequenceTagger, TaggerProtocol
Class for POS tagging.
Examples:
>>> # Load from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Or load from a local model file
>>> # tagger = POSTagger(model="pos_tagger.model")
__init__(model=None, data_maker=None, universal_tag=False, repo_id=None, model_filename=None)
¶
Constructor.
Examples:
>>> # Loading from Hugging Face Hub
>>> tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
>>> # Loading from a local model file
>>> # tagger = POSTagger(model="resources/pos_tagger.model")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | Path | None
|
Path to the local model file. |
None
|
data_maker
|
Any
|
Custom data maker function. |
None
|
universal_tag
|
bool
|
Whether to use universal POS tags. |
False
|
repo_id
|
str | None
|
Hugging Face repository ID (e.g., "roshan-research/hazm-postagger"). |
None
|
model_filename
|
str | None
|
Filename inside the repository (e.g., "pos_tagger.model"). |
None
|
__is_punc(word)
¶
Checks if a word is punctuation.
__universal_converter(tagged_list)
¶
Converts POS tags to universal tags.
data_maker(tokens)
¶
Converts tokens into features.
Examples:
>>> tokens = [['دلم', 'اینجا', 'ماندهاست', '.']]
>>> features = tagger.data_maker(tokens)
>>> features[0][0]['word']
'دلم'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[Sentence]
|
A list of sentences, where each sentence is a list of tokens. |
required |
Returns:
| Type | Description |
|---|---|
list[list[dict[str, Any]]]
|
A list of lists of feature dictionaries. |
features(sentence, index)
¶
Extracts features for a word at a given index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
Sentence
|
The sentence containing the word. |
required |
index
|
int
|
The index of the word. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary of features. |
tag(tokens)
¶
Tags a single sentence.
Examples:
>>> tagger.tag(['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
Sentence
|
A list of tokens representing a sentence. |
required |
Returns:
| Type | Description |
|---|---|
TaggedSentence
|
A tagged sentence (list of (word, tag) tuples). |
tag_sents(sentences)
¶
Tags multiple sentences.
Examples:
>>> tagger.tag_sents([['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.']])
[[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
list[Sentence]
|
A list of sentences to tag. |
required |
Returns:
| Type | Description |
|---|---|
list[TaggedSentence]
|
A list of tagged sentences. |
SpacyPOSTagger
¶
Bases: POSTagger
POS Tagger based on spaCy.
__init__(model_path=None, using_gpu=False, gpu_id=0, repo_id=None, model_filename=None)
¶
Constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | Path | None
|
Path to the local model directory. |
None
|
using_gpu
|
bool
|
Whether to use GPU. |
False
|
gpu_id
|
int
|
The ID of the GPU to use. |
0
|
repo_id
|
str | None
|
Hugging Face repository ID. |
None
|
model_filename
|
str | None
|
Filename (unused for spaCy models). |
None
|
evaluate(test_sents, batch_size=128)
¶
Evaluates the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_sents
|
list[TaggedSentence]
|
A list of tagged sentences for testing. |
required |
batch_size
|
int
|
Batch size for processing. |
128
|
tag(tokens, universal_tag=True)
¶
Tags a single sentence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
Sentence
|
A list of tokens representing a sentence. |
required |
universal_tag
|
bool
|
Whether to use universal POS tags. |
True
|
Returns:
| Type | Description |
|---|---|
TaggedSentence
|
A tagged sentence. |
tag_sents(sents, universal_tag=True, batch_size=128)
¶
Tags multiple sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sents
|
list[Sentence]
|
A list of sentences to tag. |
required |
universal_tag
|
bool
|
Whether to use universal POS tags. |
True
|
batch_size
|
int
|
Batch size for processing. |
128
|
Returns:
| Type | Description |
|---|---|
list[TaggedSentence]
|
A list of tagged sentences. |
train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)
¶
Trains the spaCy model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_dataset
|
list[TaggedSentence]
|
The training dataset. |
required |
test_dataset
|
list[TaggedSentence]
|
The testing dataset. |
required |
data_directory
|
str
|
Directory to save processed data. |
required |
base_config_file
|
str
|
Path to the base configuration file. |
required |
train_config_path
|
str
|
Path to the training configuration file. |
required |
output_dir
|
str
|
Directory to save the trained model. |
required |
use_direct_config
|
bool
|
Whether to use the configuration file directly. |
False
|
StanfordPOSTagger
¶
Bases: StanfordPOSTagger
Wrapper for Stanford POS Tagger.
__init__(model_filename, path_to_jar, *args, **kwargs)
¶
Constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_filename
|
str
|
Path to the model file. |
required |
path_to_jar
|
str
|
Path to the Stanford POS Tagger JAR file. |
required |
*args
|
Any
|
Variable length argument list. |
()
|
**kwargs
|
Any
|
Arbitrary keyword arguments. |
{}
|
tag(tokens)
¶
Tags a single sentence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
Sentence
|
A list of tokens representing a sentence. |
required |
Returns:
| Type | Description |
|---|---|
TaggedSentence
|
A tagged sentence. |
tag_sents(sentences)
¶
Tags multiple sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
list[Sentence]
|
A list of sentences to tag. |
required |
Returns:
| Type | Description |
|---|---|
list[TaggedSentence]
|
A list of tagged sentences. |