Skip to content

Chunker

The accuracy of the shallow parser in the current version is 93.4%.

This module contains classes and functions for shallow parsing (chunking) of text into noun, verb, and prepositional phrases.

Chunker

Bases: IOBTagger

Class for chunking text, training, and evaluating chunker models.

__init__(model=None, data_maker=None, repo_id=None, model_filename=None)

Constructor.

Examples:

>>> # Loading from Hugging Face Hub
>>> chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
>>> # Loading from a local model file
>>> # chunker = Chunker(model='resources/chunker.model')

Parameters:

Name Type Description Default
model str | Path | None

Path to the local model file.

None
data_maker Any

Custom data maker function.

None
repo_id str | None

Hugging Face repository ID (e.g., "roshan-research/hazm-chunker").

None
model_filename str | None

Filename inside the repository (e.g., "chunker.model").

None

data_maker(tokens)

Converts tokens into features.

Examples:

>>> tokens = [[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]]
>>> features = chunker.data_maker(tokens)
>>> features[0][0]['pos']
'PRON'

Parameters:

Name Type Description Default
tokens list[TaggedSentence]

A list of tagged sentences.

required

Returns:

Type Description
list[list[dict[str, Any]]]

A list of lists of feature dictionaries.

evaluate(trees)

Evaluates the accuracy of the chunker.

Examples:

>>> trees = [chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])]
>>> chunker.evaluate(trees)
1.0

Parameters:

Name Type Description Default
trees list[Tree]

A list of gold standard parse trees.

required

Returns:

Type Description
float

The accuracy of the chunker.

features(words, pos_tags, index)

Extracts features for a word at a given index.

Parameters:

Name Type Description Default
words list[str]

List of words in the sentence.

required
pos_tags list[str]

List of POS tags for the words.

required
index int

The index of the word to extract features for.

required

Returns:

Type Description
dict[str, Any]

A dictionary of features.

parse(sentence)

Parses a tagged sentence into a chunk tree.

Examples:

>>> tree = chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> print(tree)
(S
  (NP نامه/NOUN,EZ ایشان/PRON)
  (POSTP را/ADP)
  (VP دریافت/NOUN داشتم/VERB)
  ./PUNCT)

Parameters:

Name Type Description Default
sentence TaggedSentence

A tagged sentence.

required

Returns:

Type Description
Tree

The parsed chunk tree.

parse_sents(sentences)

Parses a list of tagged sentences into chunk trees.

Examples:

>>> sentences = [[('نامه', 'NOUN,EZ'), ('ایشان', 'PRON')], [('من', 'PRON'), ('رفتم', 'VERB')]]
>>> trees = list(chunker.parse_sents(sentences))

Parameters:

Name Type Description Default
sentences list[TaggedSentence]

A list of tagged sentences.

required

Yields:

Type Description
Tree

The parsed chunk tree for each sentence.

train(trees, c1=0.4, c2=0.04, max_iteration=400, verbose=True, file_name='chunker_crf.model', report_duration=True)

Trains the chunker model.

Parameters:

Name Type Description Default
trees list[Tree]

A list of parse trees for training.

required
c1 float

Coefficient for L1 regularization.

0.4
c2 float

Coefficient for L2 regularization.

0.04
max_iteration int

Maximum number of iterations for training.

400
verbose bool

Whether to print verbose output.

True
file_name str

The name of the file to save the trained model.

'chunker_crf.model'
report_duration bool

Whether to report the training duration.

True

RuleBasedChunker

Bases: RegexpParser

Rule-based chunker using regular expressions.

__init__()

Constructor.

SpacyChunker

Bases: Chunker

Chunker based on the Spacy library.

__init__(model_path=None, using_gpu=False, gpu_id=0, repo_id=None)

Constructor.

Parameters:

Name Type Description Default
model_path str | Path | None

Path to the local Spacy model.

None
using_gpu bool

Whether to use GPU.

False
gpu_id int

The ID of the GPU to use.

0
repo_id str | None

Hugging Face repository ID.

None

evaluate(test_sents)

Evaluates the accuracy of the chunker.

Parameters:

Name Type Description Default
test_sents list[ChunkedSentence]

A list of chunked sentences for testing.

required

Returns:

Type Description
ChunkScore

The chunk score.

parse(sentence)

Parses a single sentence.

Parameters:

Name Type Description Default
sentence TaggedSentence

A tagged sentence.

required

Returns:

Type Description
Tree

The parsed tree.

parse_sents(sentences, batch_size=128)

Parses multiple sentences.

Parameters:

Name Type Description Default
sentences list[TaggedSentence]

A list of tagged sentences.

required
batch_size int

Batch size for processing.

128

Yields:

Type Description
Tree

The parsed tree for each sentence.

train(train_dataset, test_dataset, data_directory, base_config_file, train_config_path, output_dir, use_direct_config=False)

Trains the spaCy chunker model.

Parameters:

Name Type Description Default
train_dataset list[ChunkedSentence]

The training dataset.

required
test_dataset list[ChunkedSentence]

The testing dataset.

required
data_directory str

Directory to save processed data.

required
base_config_file str

Path to the base configuration file.

required
train_config_path str

Path to the training configuration file.

required
output_dir str

Directory to save the trained model.

required
use_direct_config bool

Whether to use the configuration file directly.

False

tree2brackets(tree)

Converts a tree object to a bracketed string representation.

Examples:

>>> chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
>>> tree = chunker.parse([('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> tree2brackets(tree)
'[نامه ایشان NP] [را POSTP] [دریافت داشتم VP] .'

Parameters:

Name Type Description Default
tree Tree

The parse tree to be converted.

required

Returns:

Type Description
str

A bracketed string representation of the tree.