constituency

constituency

Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents.

To parse a tokenized sentence into constituency tree, first load a parser:

import hanlp

con = hanlp.load(hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ELECTRA_SMALL)

Then parse a sequence or multiple sequences of tokens to it.

tree = con(["2021年", "HanLPv2.1", "带来", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"])

The constituency tree is a nested list of constituencies:

tree
['TOP', [['IP', [['NP-TMP', [['_', ['2021年']]]], ['NP-PN-SBJ', [['_', ['HanLPv2.1']]]], ['VP', [['_', ['带来']], ['NP-OBJ', [['CP', [['CP', [['IP', [['VP', [['ADVP', [['_', ['最']]]], ['VP', [['_', ['先进']]]]]]]], ['_', ['的']]]]]], ['NP', [['QP', [['_', ['多']]]], ['NP', [['_', ['语种']]]]]], ['NP', [['_', ['NLP']], ['_', ['技术']]]]]]]], ['_', ['。']]]]]]

You can str or print it to get its bracketed form:

print(tree)
(TOP
  (IP
    (NP-TMP (_ 2021年))
    (NP-PN-SBJ (_ HanLPv2.1))
    (VP
      (_ 带来)
      (NP-OBJ
        (CP (CP (IP (VP (ADVP (_ 最)) (VP (_ 先进)))) (_ 的)))
        (NP (QP (_ 多)) (NP (_ 语种)))
        (NP (_ NLP) (_ 技术))))
    (_ 。)))

All the pre-trained parsers and their scores are listed below.

hanlp.pretrained.constituency.CTB9_CON_ELECTRA_SMALL = 'https://file.hankcs.com/hanlp/constituency/ctb9_con_electra_small_20220215_230116.zip'

Electra (Clark et al. 2020) small tree CRF model (Zhang et al. 2020) trained on CTB9 with major categories. Its performance is UCM=39.06% LCM=34.99% UP=90.05% UR=90.01% UF=90.03% LP=87.02% LR=86.98% LF=87.00%.

hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ELECTRA_SMALL = 'https://file.hankcs.com/hanlp/constituency/ctb9_full_tag_con_electra_small_20220118_103119.zip'

Electra (Clark et al. 2020) small tree CRF model (Zhang et al. 2020) trained on CTB9 with full subcategories. Its performance is UCM=38.29% LCM=28.95% UP=90.16% UR=90.13% UF=90.15% LP=83.46% LR=83.43% LF=83.45%.

hanlp.pretrained.constituency.CTB9_CON_FULL_TAG_ERNIE_GRAM = 'http://download.hanlp.com/constituency/extra/ctb9_full_tag_con_ernie_20220331_121430.zip'

ERNIE-GRAM (Xiao et al. 2021) base tree CRF model (Zhang et al. 2020) trained on CTB9 with full subcategories. Its performance is UCM=42.04% LCM=31.72% UP=91.33% UR=91.53% UF=91.43% LP=85.31% LR=85.49% LF=85.40%.