Tokenization


Tokenization

136/800
loading

Introduction

HanLP supports tokenization for 130 languages trained on UD 2.10. While this is beneficial for low-resource languages, it is recommended to use a monolingual model for high-resource languages such as Chinese and Japanese to achieve better performance.

How to Use

Apply for Auth

We are hosting a non-commercial API service and you are welcome to apply for an auth keyopen in new window. An auth key is a password which gives you access to our API and protects our server from being abused.

Create RESTful Client

Create a HanLPClientopen in new window:

      from hanlp_restful import HanLPClient
# Support en: English, zh: Chinese, ja: Japanese, mul: Multilingual
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='mul') 

    

Tokenize

Set tasks='tok' to perform tokenization:

      
HanLP('''In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.
2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。
2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。''', tasks='tok')

    

Native APIs

Please refer to docsopen in new window.

Last update: 12/24/2024, 4:01:42 PM
Contributors: hankcs