document
document¶
- class hanlp_common.document.Document(*args, **kwargs)[source]¶
A dict structure holding parsed annotations. A document is a subclass of
dict
and it supports every interface ofdict
. Additionally, it supports interfaces to deal with various linguistic structures. Itsstr
anddict
representations are made to be compatible with JSON serialization.- Parameters
*args – An iterator of key-value pairs.
**kwargs – Arguments from
**
operator.
Examples:
# Create a document doc = Document( tok=[["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]], pos=[["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]], ner=[[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]], dep=[[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]] ) # print(doc) or str(doc) to get its JSON representation print(doc) # Access an annotation by its task name print(doc['tok']) # Get number of sentences print(f'It has {doc.count_sentences()} sentence(s)') # Access the n-th sentence print(doc.squeeze(0)['tok']) # Pretty print it right in your console or notebook doc.pretty_print() # To save the pretty prints in a str pretty_text: str = '\n\n'.join(doc.to_pretty())
- count_sentences() int [source]¶
Count number of sentences in this document.
- Returns
Number of sentences.
- get_by_prefix(prefix: str)[source]¶
Get value by the prefix of a key.
- Parameters
prefix – The prefix of a key. If multiple keys are matched, only the first one will be used.
- Returns
The value assigned with the matched key.
- pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]¶
Print a pretty text representation which visualizes linguistic structures.
- Parameters
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header –
True
to print a header which indicates each field with its name.html –
True
to output HTML format so that non-ASCII characters can align correctly.
- squeeze(i=0)[source]¶
Squeeze the dimension of each field into one. It’s intended to convert a nested document like
[[sent_i]]
to[sent_i]
. When there are multiple sentences, only thei-th
one will be returned. Note this is not an inplace operation.- Parameters
i – Keep the element at
index
for alllist
s.- Returns
A squeezed document with only one sentence.
- to_conll(tok='tok', lem='lem', pos='pos', xpos='pos/xpos', fea='fea', dep='dep', sdp='sdp') Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]] [source]¶
Convert to
CoNLLSentence
.- Parameters
- Returns
A
CoNLLSentence
representation.
- to_json(ensure_ascii=False, indent=2) str [source]¶
Convert to json string.
- Parameters
ensure_ascii –
False
to allow for non-ascii text.indent – Indent per nested structure.
- Returns
A text representation in
str
.
- to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False) Union[str, List[str]] [source]¶
Convert to a pretty text representation which can be printed to visualize linguistic structures.
- Parameters
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header –
True
to include a header which indicates each field with its name.html –
True
to output HTML format so that non-ASCII characters can align correctly.
- Returns
A pretty string.
- translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]¶
Translate tags for each annotation. This is an inplace operation.
Attention
Note that the translated document might not print well in terminal due to non-ASCII characters.
- Parameters
lang – Target language to be translated to.
tok – Token key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
- Returns
The translated document.