Treebank Reader
This module includes classes and functions for reading the Treebank corpus.
The Treebank corpus contains thousands of tagged sentences with syntactic and morphological information.
TreebankReader
¶
This class includes functions for reading the Treebank corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing corpus files. |
required |
pos_map
|
str
|
A dictionary or function to convert fine-grained tags to coarse-grained. |
coarse_pos_e
|
join_clitics
|
bool
|
If |
False
|
join_verb_parts
|
bool
|
If |
False
|
__init__(root, pos_map=coarse_pos_e, join_clitics=False, join_verb_parts=False)
¶
Initializes the TreebankReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Path to the folder containing corpus files. |
required |
pos_map
|
str
|
A dictionary or function to convert fine-grained tags to coarse-grained. |
coarse_pos_e
|
join_clitics
|
bool
|
If |
False
|
join_verb_parts
|
bool
|
If |
False
|
chunked_trees()
¶
Returns the tree structure in a chunked format.
Examples:
>>> from hazm.chunker import tree2brackets
>>> treebank = TreebankReader(root='treebank')
>>> tree2brackets(next(treebank.chunked_trees()))
'[دنیای آدولف بورن NP] [دنیای اتفاقات رویایی NP] [است VP] .'
Yields:
| Type | Description |
|---|---|
str
|
The next chunked tree structure. |
docs()
¶
Yields the documents available in the corpus.
Yields:
| Type | Description |
|---|---|
Any
|
The next document (XML object). |
sents()
¶
Returns sentences as a list of (token, tag) tuples.
Examples:
>>> treebank = TreebankReader(root='treebank')
>>> next(treebank.sents())
[('دنیای', 'Ne'), ('آدولف', 'N'), ('بورن', 'N'), ('دنیای', 'Ne'), ('اتفاقات', 'Ne'), ('رویایی', 'AJ'), ('است', 'V'), ('.', 'PUNC')]
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence in the corpus. |
trees()
¶
Yields the tree structures available in the corpus.
Examples:
>>> treebank = TreebankReader(root='treebank')
>>> print(next(treebank.trees()))
(S
(VPS
(NPC (N دنیای/Ne) (MN (N آدولف/N) (N بورن/N)))
(VPC
(NPC (N دنیای/Ne) (NPA (N اتفاقات/Ne) (ADJ رویایی/AJ)))
(V است/V)))
(PUNC ./PUNC))
Yields:
| Type | Description |
|---|---|
str
|
The next tree structure in the corpus. |
coarse_pos_e(tags)
¶
Converts fine-grained POS tags to coarse-grained POS tags.
Examples:
>>> coarse_pos_e(['Nasp---', 'pers', 'prop'])
'N'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
A list of fine-grained POS tags. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
The corresponding coarse-grained POS tag string. |