Skip to content

Treebank Reader

This module includes classes and functions for reading the Treebank corpus.

The Treebank corpus contains thousands of tagged sentences with syntactic and morphological information.

TreebankReader

This class includes functions for reading the Treebank corpus.

Parameters:

Name Type Description Default
root str

Path to the folder containing corpus files.

required
pos_map str

A dictionary or function to convert fine-grained tags to coarse-grained.

coarse_pos_e
join_clitics bool

If True, joins clitics to their parent word.

False
join_verb_parts bool

If True, joins multi-part verbs using an underscore.

False

__init__(root, pos_map=coarse_pos_e, join_clitics=False, join_verb_parts=False)

Initializes the TreebankReader.

Parameters:

Name Type Description Default
root str

Path to the folder containing corpus files.

required
pos_map str

A dictionary or function to convert fine-grained tags to coarse-grained.

coarse_pos_e
join_clitics bool

If True, joins clitics to their parent word.

False
join_verb_parts bool

If True, joins multi-part verbs using an underscore.

False

chunked_trees()

Returns the tree structure in a chunked format.

Examples:

>>> from hazm.chunker import tree2brackets
>>> treebank = TreebankReader(root='treebank')
>>> tree2brackets(next(treebank.chunked_trees()))
'[دنیای آدولف بورن NP] [دنیای اتفاقات رویایی NP] [است VP] .'

Yields:

Type Description
str

The next chunked tree structure.

docs()

Yields the documents available in the corpus.

Yields:

Type Description
Any

The next document (XML object).

sents()

Returns sentences as a list of (token, tag) tuples.

Examples:

>>> treebank = TreebankReader(root='treebank')
>>> next(treebank.sents())
[('دنیای', 'Ne'), ('آدولف', 'N'), ('بورن', 'N'), ('دنیای', 'Ne'), ('اتفاقات', 'Ne'), ('رویایی', 'AJ'), ('است', 'V'), ('.', 'PUNC')]

Yields:

Type Description
list[tuple[str, str]]

The next sentence in the corpus.

trees()

Yields the tree structures available in the corpus.

Examples:

>>> treebank = TreebankReader(root='treebank')
>>> print(next(treebank.trees()))
(S
  (VPS
    (NPC (N دنیای/Ne) (MN (N آدولف/N) (N بورن/N)))
    (VPC
      (NPC (N دنیای/Ne) (NPA (N اتفاقات/Ne) (ADJ رویایی/AJ)))
      (V است/V)))
  (PUNC ./PUNC))

Yields:

Type Description
str

The next tree structure in the corpus.

coarse_pos_e(tags)

Converts fine-grained POS tags to coarse-grained POS tags.

Examples:

>>> coarse_pos_e(['Nasp---', 'pers', 'prop'])
'N'

Parameters:

Name Type Description Default
tags list[str]

A list of fine-grained POS tags.

required

Returns:

Type Description
list[str]

The corresponding coarse-grained POS tag string.