tok

tok

Tokenization is a way of separating a sentence into smaller units called tokens. In lexical analysis, tokens usually refer to words.

To tokenize raw sentences:

import hanlp

tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
tok(['商品和服务。', '阿婆主来到北京立方庭参观自然语义科技公司'])
[['商品', '和', '服务', '。'], ['阿婆主', '来到', '北京立方庭', '参观', '自然语义科技公司']]

All the pre-trained tokenizers and their details are listed below.

hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH = 'https://file.hankcs.com/hanlp/tok/coarse_electra_small_20220616_012050.zip'

Electra (Clark et al. 2020) small model trained on coarse-grained CWS corpora. Its performance is P: 98.34% R: 98.38% F1: 98.36% which is much higher than that of MTL model

hanlp.pretrained.tok.CTB6_CONVSEG = 'https://file.hankcs.com/hanlp/tok/ctb6_convseg_nowe_nocrf_20200110_004046.zip'

Conv model (Wang & Xu 2017) trained on CTB6 dataset.

hanlp.pretrained.tok.CTB9_TOK_ELECTRA_BASE = 'http://download.hanlp.com/tok/extra/ctb9_tok_electra_base_20220426_111949.zip'

Electra (Clark et al. 2020) base model trained on CTB9. Its performance is P: 97.62% R: 97.67% F1: 97.65% which is much higher than that of MTL model

hanlp.pretrained.tok.CTB9_TOK_ELECTRA_BASE_CRF = 'http://download.hanlp.com/tok/extra/ctb9_tok_electra_base_crf_20220426_161255.zip'

Electra (Clark et al. 2020) base model trained on CTB9. Its performance is P: 97.68% R: 97.71% F1: 97.69% which is much higher than that of MTL model

hanlp.pretrained.tok.CTB9_TOK_ELECTRA_SMALL = 'https://file.hankcs.com/hanlp/tok/ctb9_electra_small_20220215_205427.zip'

Electra (Clark et al. 2020) small model trained on CTB9. Its performance is P=97.15% R=97.36% F1=97.26% which is much higher than that of MTL model

hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH = 'https://file.hankcs.com/hanlp/tok/fine_electra_small_20220615_231803.zip'

Electra (Clark et al. 2020) small model trained on fine-grained CWS corpora. Its performance is P: 98.14% R: 98.07% F1: 98.11% which is much higher than that of MTL model

hanlp.pretrained.tok.LARGE_ALBERT_BASE = 'https://file.hankcs.com/hanlp/tok/large_corpus_cws_albert_base_20211228_160926.zip'

ALBERT model (Lan et al. 2020) trained on the largest CWS dataset in the world.

hanlp.pretrained.tok.MSR_TOK_ELECTRA_BASE_CRF = 'http://download.hanlp.com/tok/extra/msra_crf_electra_base_20220507_113936.zip'

Electra (Clark et al. 2020) base model trained on MSR CWS dataset. Its performance is P: 98.71% R: 98.64% F1: 98.68% which is much higher than that of MTL model

hanlp.pretrained.tok.PKU_NAME_MERGED_SIX_MONTHS_CONVSEG = 'https://file.hankcs.com/hanlp/tok/pku98_6m_conv_ngram_20200110_134736.zip'

Conv model (Wang & Xu 2017) trained on pku98 six months dataset with familiy name and given name merged into one unit.

hanlp.pretrained.tok.SIGHAN2005_MSR_CONVSEG = 'https://file.hankcs.com/hanlp/tok/convseg-msr-nocrf-noembed_20200110_153524.zip'

Conv model (Wang & Xu 2017) trained on sighan2005 msr dataset.

hanlp.pretrained.tok.SIGHAN2005_PKU_BERT_BASE_ZH = 'https://file.hankcs.com/hanlp/tok/sighan2005_pku_bert_base_zh_20201231_141130.zip'

BERT model (Devlin et al. 2019) trained on sighan2005 pku dataset.

hanlp.pretrained.tok.SIGHAN2005_PKU_CONVSEG = 'https://file.hankcs.com/hanlp/tok/sighan2005-pku-convseg_20200110_153722.zip'

Conv model (Wang & Xu 2017) trained on sighan2005 pku dataset.

hanlp.pretrained.tok.UD_TOK_MMINILMV2L12 = 'https://file.hankcs.com/hanlp/tok/ud_tok_mMiniLMv2L12_no_space_mul_20220619_091159.zip'

mMiniLMv2 (Wang et al. 2021) L12xH384 based tokenizer trained on UD 2.10. The following 130 languages are supported: Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, AncientGreek (to 1453), Ancient Hebrew, Apurinã, Arabic, Armenian, AssyrianNeo-Aramaic, Bambara, Basque, Beja, Belarusian, Bengali, Bhojpuri, Breton, Bulgarian, Catalan, Cebuano, Central Siberian Yupik, Chinese, Chukot, ChurchSlavic, Coptic, Croatian, Czech, Danish, Dutch, Emerillon, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Guajajára, Guarani, Hebrew, Hindi, Hittite, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, K'iche', Kangri, Karelian, Karo(Brazil), Kazakh, Khunsari, Komi-Permyak, Komi-Zyrian, Korean, Latin, Latvian, Ligurian, LiteraryChinese, Lithuanian, Livvi, LowGerman, Madi, Makuráp, Maltese, Manx, Marathi, MbyáGuaraní, Modern Greek (1453-), Moksha, Mundurukú, Nayini, Neapolitan, Nigerian Pidgin, NorthernKurdish, Northern Sami, Norwegian, OldFrench (842-ca. 1400), OldRussian, Old Turkish, Persian, Polish, Portuguese, Romanian, Russia Buriat, Russian, Sanskrit, ScottishGaelic, Serbian, SkoltSami, Slovak, Slovenian, Soi, South Levantine Arabic, Spanish, Swedish, SwedishSign Language, SwissGerman, Tagalog, Tamil, Tatar, Telugu, Thai, Tupinambá, Turkish, Uighur, Ukrainian, Umbrian, UpperSorbian, Urdu, Urubú-Kaapor, Vietnamese, Warlpiri, Welsh, Western Armenian, WesternFrisian, Wolof, Xibe, Yakut, Yoruba, YueChinese. Performance: P: 95.41% R: 95.25% F1: 95.33%.

hanlp.pretrained.tok.UD_TOK_MMINILMV2L6 = 'https://file.hankcs.com/hanlp/tok/ud_tok_mMiniLMv2L6_no_space_mul_20220619_091824.zip'

mMiniLMv2 (Wang et al. 2021) L6xH384 based tokenizer trained on UD 2.10. The following 130 languages are supported: Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, AncientGreek (to 1453), Ancient Hebrew, Apurinã, Arabic, Armenian, AssyrianNeo-Aramaic, Bambara, Basque, Beja, Belarusian, Bengali, Bhojpuri, Breton, Bulgarian, Catalan, Cebuano, Central Siberian Yupik, Chinese, Chukot, ChurchSlavic, Coptic, Croatian, Czech, Danish, Dutch, Emerillon, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Guajajára, Guarani, Hebrew, Hindi, Hittite, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, K'iche', Kangri, Karelian, Karo(Brazil), Kazakh, Khunsari, Komi-Permyak, Komi-Zyrian, Korean, Latin, Latvian, Ligurian, LiteraryChinese, Lithuanian, Livvi, LowGerman, Madi, Makuráp, Maltese, Manx, Marathi, MbyáGuaraní, Modern Greek (1453-), Moksha, Mundurukú, Nayini, Neapolitan, Nigerian Pidgin, NorthernKurdish, Northern Sami, Norwegian, OldFrench (842-ca. 1400), OldRussian, Old Turkish, Persian, Polish, Portuguese, Romanian, Russia Buriat, Russian, Sanskrit, ScottishGaelic, Serbian, SkoltSami, Slovak, Slovenian, Soi, South Levantine Arabic, Spanish, Swedish, SwedishSign Language, SwissGerman, Tagalog, Tamil, Tatar, Telugu, Thai, Tupinambá, Turkish, Uighur, Ukrainian, Umbrian, UpperSorbian, Urdu, Urubú-Kaapor, Vietnamese, Warlpiri, Welsh, Western Armenian, WesternFrisian, Wolof, Xibe, Yakut, Yoruba, YueChinese. Performance: P: 94.99% R: 94.74% F1: 94.86%.