Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents.
To parse a tokenized sentence into constituency tree, first load a parser:
import hanlp con = hanlp.load(hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ELECTRA_SMALL)
Then parse a sequence or multiple sequences of tokens to it.
tree = con(["2021年", "HanLPv2.1", "带来", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"])
The constituency tree is a nested list of constituencies:
['TOP', [['IP', [['NP-TMP', [['_', ['2021年']]]], ['NP-PN-SBJ', [['_', ['HanLPv2.1']]]], ['VP', [['_', ['带来']], ['NP-OBJ', [['CP', [['CP', [['IP', [['VP', [['ADVP', [['_', ['最']]]], ['VP', [['_', ['先进']]]]]]]], ['_', ['的']]]]]], ['NP', [['QP', [['_', ['多']]]], ['NP', [['_', ['语种']]]]]], ['NP', [['_', ['NLP']], ['_', ['技术']]]]]]]], ['_', ['。']]]]]]
(TOP (IP (NP-TMP (_ 2021年)) (NP-PN-SBJ (_ HanLPv2.1)) (VP (_ 带来) (NP-OBJ (CP (CP (IP (VP (ADVP (_ 最)) (VP (_ 先进)))) (_ 的))) (NP (QP (_ 多)) (NP (_ 语种))) (NP (_ NLP) (_ 技术)))) (_ 。)))
All the pre-trained parsers and their scores are listed below.
- hanlp.pretrained.constituency.CTB9_CON_ELECTRA_SMALL = 'https://file.hankcs.com/hanlp/constituency/ctb9_con_electra_small_20220215_230116.zip'¶
Electra (Clark et al. 2020) small tree CRF model (Zhang et al. 2020) trained on CTB9 with major categories. Its performance is UCM=39.06% LCM=34.99% UP=90.05% UR=90.01% UF=90.03% LP=87.02% LR=86.98% LF=87.00%.
- hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ELECTRA_SMALL = 'https://file.hankcs.com/hanlp/constituency/ctb9_full_tag_con_electra_small_20220118_103119.zip'¶
Electra (Clark et al. 2020) small tree CRF model (Zhang et al. 2020) trained on CTB9 with full subcategories. Its performance is UCM=38.29% LCM=28.95% UP=90.16% UR=90.13% UF=90.15% LP=83.46% LR=83.43% LF=83.45%.
- hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ERNIE_GRAM = 'http://download.hanlp.com/constituency/extra/ctb9_full_tag_con_ernie_20220331_121430.zip'¶
ERNIE-GRAM (Xiao et al. 2021) base tree CRF model (Zhang et al. 2020) trained on CTB9 with full subcategories. Its performance is UCM=42.04% LCM=31.72% UP=91.33% UR=91.53% UF=91.43% LP=85.31% LR=85.49% LF=85.40%.