Chunker
The accuracy of the shallow parser in the current version is 93.4%.
This module contains classes and functions for shallow parsing (chunking) of text into noun, verb, and prepositional phrases.
Chunker
¶
Bases: IOBTagger
Class for chunking text, training, and evaluating chunker models.
__init__(model=None, data_maker=None, repo_id=None, model_filename=None)
¶
Constructor.
Examples:
>>> # Loading from Hugging Face Hub
>>> chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
>>> # Loading from a local model file
>>> # chunker = Chunker(model='resources/chunker.model')
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | Path | None
|
Path to the local model file. |
None
|
data_maker
|
Any
|
Custom data maker function. |
None
|
repo_id
|
str | None
|
Hugging Face repository ID (e.g., "roshan-research/hazm-chunker"). |
None
|
model_filename
|
str | None
|
Filename inside the repository (e.g., "chunker.model"). |
None
|
data_maker(tokens)
¶
Converts tokens into features.
Examples:
>>> tokens = [[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]]
>>> features = chunker.data_maker(tokens)
>>> features[0][0]['pos']
'PRON'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[TaggedSentence]
|
A list of tagged sentences. |
required |
Returns:
| Type | Description |
|---|---|
list[list[dict[str, Any]]]
|
A list of lists of feature dictionaries. |
evaluate(trees)
¶
Evaluates the accuracy of the chunker.
Examples:
>>> trees = [chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])]
>>> chunker.evaluate(trees)
1.0
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trees
|
list[Tree]
|
A list of gold standard parse trees. |
required |
Returns:
| Type | Description |
|---|---|
float
|
The accuracy of the chunker. |
features(words, pos_tags, index)
¶
Extracts features for a word at a given index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
words
|
list[str]
|
List of words in the sentence. |
required |
pos_tags
|
list[str]
|
List of POS tags for the words. |
required |
index
|
int
|
The index of the word to extract features for. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary of features. |
parse(sentence)
¶
Parses a tagged sentence into a chunk tree.
Examples:
>>> tree = chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> print(tree)
(S
(NP نامه/NOUN,EZ ایشان/PRON)
(POSTP را/ADP)
(VP دریافت/NOUN داشتم/VERB)
./PUNCT)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
TaggedSentence
|
A tagged sentence. |
required |
Returns:
| Type | Description |
|---|---|
Tree
|
The parsed chunk tree. |
parse_sents(sentences)
¶
Parses a list of tagged sentences into chunk trees.
Examples:
>>> sentences = [[('نامه', 'NOUN,EZ'), ('ایشان', 'PRON')], [('من', 'PRON'), ('رفتم', 'VERB')]]
>>> trees = list(chunker.parse_sents(sentences))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
list[TaggedSentence]
|
A list of tagged sentences. |
required |
Yields:
| Type | Description |
|---|---|
Tree
|
The parsed chunk tree for each sentence. |
train(trees, c1=0.4, c2=0.04, max_iteration=400, verbose=True, file_name='chunker_crf.model', report_duration=True)
¶
Trains the chunker model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trees
|
list[Tree]
|
A list of parse trees for training. |
required |
c1
|
float
|
Coefficient for L1 regularization. |
0.4
|
c2
|
float
|
Coefficient for L2 regularization. |
0.04
|
max_iteration
|
int
|
Maximum number of iterations for training. |
400
|
verbose
|
bool
|
Whether to print verbose output. |
True
|
file_name
|
str
|
The name of the file to save the trained model. |
'chunker_crf.model'
|
report_duration
|
bool
|
Whether to report the training duration. |
True
|
RuleBasedChunker
¶
SpacyChunker
¶
Bases: Chunker
Chunker based on the Spacy library.
__init__(model_path=None, using_gpu=False, gpu_id=0, repo_id=None)
¶
Constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | Path | None
|
Path to the local Spacy model. |
None
|
using_gpu
|
bool
|
Whether to use GPU. |
False
|
gpu_id
|
int
|
The ID of the GPU to use. |
0
|
repo_id
|
str | None
|
Hugging Face repository ID. |
None
|
evaluate(test_sents)
¶
Evaluates the accuracy of the chunker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_sents
|
list[ChunkedSentence]
|
A list of chunked sentences for testing. |
required |
Returns:
| Type | Description |
|---|---|
ChunkScore
|
The chunk score. |
parse(sentence)
¶
Parses a single sentence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
TaggedSentence
|
A tagged sentence. |
required |
Returns:
| Type | Description |
|---|---|
Tree
|
The parsed tree. |
parse_sents(sentences, batch_size=128)
¶
Parses multiple sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
list[TaggedSentence]
|
A list of tagged sentences. |
required |
batch_size
|
int
|
Batch size for processing. |
128
|
Yields:
| Type | Description |
|---|---|
Tree
|
The parsed tree for each sentence. |
train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)
¶
Trains the spaCy chunker model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_dataset
|
list[ChunkedSentence]
|
The training dataset. |
required |
test_dataset
|
list[ChunkedSentence]
|
The testing dataset. |
required |
data_directory
|
str
|
Directory to save processed data. |
required |
base_config_file
|
str
|
Path to the base configuration file. |
required |
train_config_path
|
str
|
Path to the training configuration file. |
required |
output_dir
|
str
|
Directory to save the trained model. |
required |
use_direct_config
|
bool
|
Whether to use the configuration file directly. |
False
|
tree2brackets(tree)
¶
Converts a tree object to a bracketed string representation.
Examples:
>>> chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
>>> tree = chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> tree2brackets(tree)
'[نامه ایشان NP] [را POSTP] [دریافت داشتم VP] .'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
Tree
|
The parse tree to be converted. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A bracketed string representation of the tree. |