Data Format
Contents
Data Format¶
Input Format¶
RESTful Input¶
Definition¶
To make a RESTful call, one needs to send a json
HTTP POST request to the server, which contains at least a text
field or a tokens
field. The input to RESTful API is very flexible. It can be one of the following 3 formats:
It can be a document of raw
str
filled intotext
. The server will split it into sentences.It can be a
list
of sentences, each sentence is a rawstr
, filled intotext
.It can be a
list
of tokenized sentences, each sentence is a list ofstr
typed tokens, filled intotokens
.
Additionally, fine-grained controls are performed with the arguments defined in
hanlp_restful.HanLPClient.parse()
.
Examples¶
curl -X 'POST' \
'https://hanlp.hankcs.com/api/parse' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"language": "zh",
"text": "HanLP为生产环境带来次世代最先进的多语种NLP技术。晓美焰来到北京参观自然语义科技公司。"
}'
Model Input¶
The input format to models is specified per model and per task. Generally speaking, if a model has no tokenizer built in, then its input is
a sentence in list[str]
form (a list of tokens), or multiple such sentences nested in a list
.
If a model has a tokenizer built in, each sentence is in str
form.
Additionally, you can use skip_tasks='tok*'
to ask the model to use your tokenized inputs instead of tokenizing
them, in which case, each of your sentence needs to be in list[str]
form, as if there was no tokenizer.
For any model, its input is of sentence level, which means you have to split a document into sentences beforehand.
You may want to try NgramSentenceBoundaryDetector
for sentence splitting.
Output Format¶
The outputs of both HanLPClient
and
MultiTaskLearning
are unified as the same
Document
format.
For example, the following RESTful codes will output such an instance.
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None) # Fill in your auth
print(HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。晓美焰来到北京立方庭参观自然语义科技公司。'))
{
"tok/fine": [
["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次", "世代", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"],
["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]
],
"tok/coarse": [
["2021年", "HanLPv2.1", "为", "生产环境", "带来", "次世代", "最", "先进", "的", "多语种", "NLP", "技术", "。"],
["晓美焰", "来到", "北京立方庭", "参观", "自然语义科技公司", "。"]
],
"pos/ctb": [
["NT", "NR", "P", "NN", "NN", "VV", "JJ", "NN", "AD", "VA", "DEC", "CD", "NN", "NN", "NN", "PU"],
["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
],
"pos/pku": [
["t", "nx", "p", "vn", "n", "v", "t", "n", "d", "a", "u", "a", "n", "nx", "n", "w"],
["nr", "v", "ns", "nz", "v", "n", "n", "n", "n", "w"]
],
"pos/863": [
["nt", "w", "p", "v", "n", "v", "a", "nt", "d", "a", "u", "a", "n", "w", "n", "w"],
["nh", "v", "ns", "ni", "v", "n", "n", "n", "n", "w"]
],
"ner/msra": [
[["2021年", "DATE", 0, 1], ["HanLPv2.1", "ORGANIZATION", 1, 2]],
[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"ner/pku": [
[],
[["晓美焰", "nr", 0, 1], ["北京", "ns", 2, 3]]
],
"ner/ontonotes": [
[["2021年", "DATE", 0, 1]],
[["晓美焰", "PERSON", 0, 1], ["北京", "FAC", 2, 3], ["立方庭", "FAC", 3, 4], ["自然语义科技公司", "ORG", 5, 9]]
],
"srl": [
[[["2021年", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["为生产环境", "ARG2", 2, 5], ["带来", "PRED", 5, 6], ["次世代最先进的多语种NLP技术", "ARG1", 6, 15]], [["次世代", "ARGM-TMP", 6, 8], ["最", "ARGM-ADV", 8, 9], ["先进", "PRED", 9, 10], ["NLP技术", "ARG0", 13, 15]]],
[[["晓美焰", "ARG0", 0, 1], ["来到", "PRED", 1, 2], ["北京立方庭", "ARG1", 2, 4]], [["晓美焰", "ARG0", 0, 1], ["参观", "PRED", 4, 5], ["自然语义科技公司", "ARG1", 5, 9]]]
],
"dep": [
[[6, "nmod:tmod"], [6, "nsubj"], [5, "case"], [5, "compound:nn"], [6, "nmod:prep"], [0, "root"], [8, "amod"], [10, "dep"], [10, "advmod"], [15, "amod"], [10, "mark"], [13, "dep"], [15, "compound:nn"], [15, "compound:nn"], [6, "dobj"], [6, "punct"]],
[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound:nn"], [9, "compound:nn"], [9, "compound:nn"], [5, "dobj"], [2, "punct"]]
],
"sdp": [
[[[6, "Time"]], [[6, "Exp"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[3, "eResu"]], [[8, "Desc"]], [[15, "Time"]], [[10, "mDegr"]], [[15, "Desc"]], [[10, "mAux"]], [[13, "Quan"]], [[15, "Desc"]], [[15, "Nmod"]], [[6, "Cont"]], [[0, "Root"], [6, "mPunc"]]],
[[[2, "Agt"], [5, "Agt"]], [[1, "eCoo"]], [[4, "Nmod"]], [[2, "Lfin"], [5, "Loc"]], [[2, "ePurp"]], [[7, "Desc"], [8, "Desc"]], [[8, "Desc"]], [[9, "Desc"]], [[5, "Datv"]], [[2, "mPunc"], [5, "mPunc"]]]
],
"con": [
["TOP", [["IP", [["NP", [["NT", ["2021年"]]]], ["NP", [["NR", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["为"]], ["NP", [["NN", ["生产"]], ["NN", ["环境"]]]]]], ["VP", [["VV", ["带来"]], ["NP", [["CP", [["CP", [["IP", [["VP", [["NP", [["ADJP", [["JJ", ["次"]]]], ["NP", [["NN", ["世代"]]]]]], ["ADVP", [["AD", ["最"]]]], ["VP", [["VA", ["先进"]]]]]]]], ["DEC", ["的"]]]]]], ["NP", [["QP", [["CD", ["多"]]]], ["NP", [["NN", ["语种"]]]]]], ["NP", [["NN", ["NLP"]], ["NN", ["技术"]]]]]]]]]], ["PU", ["。"]]]]]],
["TOP", [["IP", [["NP", [["NR", ["晓美焰"]]]], ["VP", [["VV", ["来到"]], ["NP", [["NR", ["北京"]], ["NR", ["立方庭"]]]], ["VP", [["VV", ["参观"]], ["NP", [["NN", ["自然"]], ["NN", ["语义"]], ["NN", ["科技"]], ["NN", ["公司"]]]]]]]], ["PU", ["。"]]]]]]
]
}
The outputs above is represented as a json
dictionary where each key is a task name and its value is
the output of the corresponding task.
For each output, if it’s a nested list
then it contains multiple sentences otherwise it’s just one single sentence.
We make the following naming convention of NLP tasks, each consists of 3 letters.
Naming Convention¶
key |
Task |
Chinese |
---|---|---|
tok |
Tokenization. Each element is a token. |
分词 |
pos |
Part-of-Speech Tagging. Each element is a tag. |
词性标注 |
lem |
Lemmatization. Each element is a lemma. |
词干提取 |
fea |
Features of Universal Dependencies. Each element is a feature. |
词法语法特征 |
ner |
Named Entity Recognition. Each element is a tuple of |
命名实体识别 |
dep |
Dependency Parsing. Each element is a tuple of |
依存句法分析 |
con |
Constituency Parsing. Each list is a bracketed constituent. |
短语成分分析 |
srl |
Semantic Role Labeling. Similar to |
语义角色标注 |
sdp |
Semantic Dependency Parsing. Similar to |
语义依存分析 |
amr |
Abstract Meaning Representation. Each AMR graph is represented as list of logical triples. See AMR guidelines. |
抽象意义表示 |
When there are multiple models performing the same task, their keys are appended with a secondary identifier.
For example, tok/fine
and tok/corase
means a fine-grained tokenization model and a coarse-grained one respectively.