Skip to content

Dadegan Reader

This module includes classes and functions for reading the PerDT corpus.

PerDT contains a significant number of tagged sentences with syntactic and morphological information.

DadeganReader

This class includes methods for reading the PerDT corpus.

Parameters:

Name Type Description Default
conll_file str

Path to the corpus file in CoNLL format.

required
pos_map Any

A function to map fine-grained tags to coarse-grained ones.

coarse_pos_e
universal_pos bool

If True, uses universal POS tags.

False

__init__(conll_file, pos_map=coarse_pos_e, universal_pos=False)

Initializes the DadeganReader.

Parameters:

Name Type Description Default
conll_file str

Path to the corpus file.

required
pos_map Any

Function for mapping tags. Defaults to coarse_pos_e.

coarse_pos_e
universal_pos bool

Whether to use universal POS mapping. Defaults to False.

False

chunked_trees()

Yields dependency trees of sentences with chunking information.

Examples:

>>> from hazm.chunker import tree2brackets
>>> dadegan = DadeganReader(conll_file='dadegan.conll')
>>> tree2brackets(next(dadegan.chunked_trees()))
'[این میهمانی NP] [به PP] [منظور آشنایی هم‌تیمی‌های او NP] [با PP] [غذاهای ایرانی NP] [ترتیب داده_شد VP] .'

Yields:

Type Description
type[Tree]

The next sentence as a chunked tree structure.

sents()

Returns a list of sentences, where each sentence is a list of (token, tag) tuples.

Examples:

>>> dadegan = DadeganReader(conll_file='dadegan.conll')
>>> next(dadegan.sents())
[('این', 'DET'), ('میهمانی', 'N'), ('به', 'P'), ('منظور', 'Ne'), ('آشنایی', 'Ne'), ('هم‌تیمی‌های', 'Ne'), ('او', 'PRO'), ('با', 'P'), ('غذاهای', 'Ne'), ('ایرانی', 'AJ'), ('ترتیب', 'N'), ('داده_شد', 'V'), ('.', 'PUNC')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence as a list of (token, tag) tuples.

trees()

Yields the tree structure of sentences.

Yields:

Type Description
type[Tree]

The dependency tree of the next sentence.

coarse_pos_e(tags, word)

Converts fine-grained tags to coarse-grained POS tags.

Examples:

>>> coarse_pos_e(['N', 'IANM'], 'امروز')
'N'

Parameters:

Name Type Description Default
tags list[str]

A list of fine-grained tags.

required
word str

The word associated with the tags.

required

Returns:

Type Description
str

The corresponding coarse-grained POS tag.

coarse_pos_u(tags, word)

Converts fine-grained tags to coarse-grained universal POS tags.

Examples:

>>> coarse_pos_u(['N', 'IANM'], 'امروز')
'NOUN'

Parameters:

Name Type Description Default
tags list[str]

A list of fine-grained tags.

required
word str

The word associated with the tags.

required

Returns:

Type Description
str

The corresponding coarse-grained universal POS tag.

node_deps(node)

Returns the values found in the 'deps' field of the input node.

Parameters:

Name Type Description Default
node dict[str, Any]

The node dictionary.

required

Returns:

Type Description
list[Any]

A list of dependency addresses.

word_nodes(tree)

Returns the nodes of the tree in sorted order by their address.

Parameters:

Name Type Description Default
tree type[Tree]

The dependency tree object.

required

Returns:

Type Description
list[dict[str, Any]]

A sorted list of node dictionaries.