document¶

class hanlp_common.document.Document(*args, **kwargs)[source]¶

A dict structure holding parsed annotations. A document is a subclass of dict and it supports every interface of dict. Additionally, it supports interfaces to deal with various linguistic structures. Its str and dict representations are made to be compatible with JSON serialization.

Parameters

*args – An iterator of key-value pairs.
**kwargs – Arguments from ** operator.

Examples:

# Create a document
doc = Document(
    tok=[["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]],
    pos=[["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]],
    ner=[[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4],
          ["自然语义科技公司", "ORGANIZATION", 5, 9]]],
    dep=[[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"],
          [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]]
)

# print(doc) or str(doc) to get its JSON representation
print(doc)

# Access an annotation by its task name
print(doc['tok'])

# Get number of sentences
print(f'It has {doc.count_sentences()} sentence(s)')

# Access the n-th sentence
print(doc.squeeze(0)['tok'])

# Pretty print it right in your console or notebook
doc.pretty_print()

# To save the pretty prints in a str
pretty_text: str = '\n\n'.join(doc.to_pretty())

count_sentences() → int[source]¶

Count number of sentences in this document.

Returns: Number of sentences.

get_by_prefix(prefix: str)[source]¶

Get value by the prefix of a key.

Parameters: prefix – The prefix of a key. If multiple keys are matched, only the first one will be used.
Returns: The value assigned with the matched key.

pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]¶

Print a pretty text representation which visualizes linguistic structures.

Parameters

tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to print a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.

squeeze(i=0)[source]¶

Squeeze the dimension of each field into one. It’s intended to convert a nested document like [[sent_i]] to [sent_i]. When there are multiple sentences, only the i-th one will be returned. Note this is not an inplace operation.

Parameters: i – Keep the element at index for all lists.
Returns: A squeezed document with only one sentence.

to_conll(tok='tok', lem='lem', pos='pos', xpos='pos/xpos', fea='fea', dep='dep', sdp='sdp') → Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]][source]¶

Convert to CoNLLSentence.

Parameters

tok (str) – Field name for tok.
lem (str) – Field name for lem.
pos (str) – Field name for upos.
xpos (str) – Field name for xpos.
fea (str) – Field name for feats.
dep (str) – Field name for dependency parsing.
sdp (str) – Field name for semantic dependency parsing.

Returns

A CoNLLSentence representation.

to_dict()[source]¶

Convert to a json compatible dict.

Returns: A dict representation.

to_json(ensure_ascii=False, indent=2) → str[source]¶

Convert to json string.

Parameters

ensure_ascii – False to allow for non-ascii text.
indent – Indent per nested structure.

Returns

A text representation in str.

to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False) → Union[str, List[str]][source]¶

Convert to a pretty text representation which can be printed to visualize linguistic structures.

Parameters

tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to include a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.

Returns

A pretty string.

translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]¶

Translate tags for each annotation. This is an inplace operation.

Attention

Note that the translated document might not print well in terminal due to non-ASCII characters.

Parameters

lang – Target language to be translated to.
tok – Token key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.

Returns

The translated document.

HanLP Documentation

document

document¶