Dadegan Reader
This module includes classes and functions for reading the PerDT corpus.
PerDT contains a significant number of tagged sentences with syntactic and morphological information.
DadeganReader
¶
This class includes methods for reading the PerDT corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conll_file
|
str
|
Path to the corpus file in CoNLL format. |
required |
pos_map
|
Any
|
A function to map fine-grained tags to coarse-grained ones. |
coarse_pos_e
|
universal_pos
|
bool
|
If |
False
|
__init__(conll_file, pos_map=coarse_pos_e, universal_pos=False)
¶
Initializes the DadeganReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conll_file
|
str
|
Path to the corpus file. |
required |
pos_map
|
Any
|
Function for mapping tags. Defaults to |
coarse_pos_e
|
universal_pos
|
bool
|
Whether to use universal POS mapping. Defaults to |
False
|
chunked_trees()
¶
Yields dependency trees of sentences with chunking information.
Examples:
>>> from hazm.chunker import tree2brackets
>>> dadegan = DadeganReader(conll_file='dadegan.conll')
>>> tree2brackets(next(dadegan.chunked_trees()))
'[این میهمانی NP] [به PP] [منظور آشنایی همتیمیهای او NP] [با PP] [غذاهای ایرانی NP] [ترتیب داده_شد VP] .'
Yields:
| Type | Description |
|---|---|
type[Tree]
|
The next sentence as a chunked tree structure. |
sents()
¶
Returns a list of sentences, where each sentence is a list of (token, tag) tuples.
Examples:
>>> dadegan = DadeganReader(conll_file='dadegan.conll')
>>> next(dadegan.sents())
[('این', 'DET'), ('میهمانی', 'N'), ('به', 'P'), ('منظور', 'Ne'), ('آشنایی', 'Ne'), ('همتیمیهای', 'Ne'), ('او', 'PRO'), ('با', 'P'), ('غذاهای', 'Ne'), ('ایرانی', 'AJ'), ('ترتیب', 'N'), ('داده_شد', 'V'), ('.', 'PUNC')]
Yields:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
The next sentence as a list of (token, tag) tuples. |
trees()
¶
Yields the tree structure of sentences.
Yields:
| Type | Description |
|---|---|
type[Tree]
|
The dependency tree of the next sentence. |
coarse_pos_e(tags, word)
¶
Converts fine-grained tags to coarse-grained POS tags.
Examples:
>>> coarse_pos_e(['N', 'IANM'], 'امروز')
'N'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
A list of fine-grained tags. |
required |
word
|
str
|
The word associated with the tags. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The corresponding coarse-grained POS tag. |
coarse_pos_u(tags, word)
¶
Converts fine-grained tags to coarse-grained universal POS tags.
Examples:
>>> coarse_pos_u(['N', 'IANM'], 'امروز')
'NOUN'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
A list of fine-grained tags. |
required |
word
|
str
|
The word associated with the tags. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The corresponding coarse-grained universal POS tag. |
node_deps(node)
¶
Returns the values found in the 'deps' field of the input node.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node
|
dict[str, Any]
|
The node dictionary. |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
A list of dependency addresses. |
word_nodes(tree)
¶
Returns the nodes of the tree in sorted order by their address.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
type[Tree]
|
The dependency tree object. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
A sorted list of node dictionaries. |