document

class hanlp_common.document.Document(*args, **kwargs)[source]

A dict structure holding parsed annotations.

Parameters
  • *args – An iterator of key-value pairs.

  • **kwargs – Arguments from ** operator.

pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]

Print a pretty text representation which visualizes linguistic structures.

Parameters
  • tok – Token key.

  • lem – Lemma key.

  • pos – Part-of-speech key.

  • dep – Dependency parse tree key.

  • sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

  • ner – Named entity key.

  • srl – Semantic role labeling key.

  • con – Constituency parsing key.

  • show_headerTrue to print a header which indicates each field with its name.

  • htmlTrue to output HTML format so that non-ASCII characters can align correctly.

squeeze()[source]

Squeeze the dimension of each field into one. It’s intended to convert a nested document like [[sent1]] to [sent1]. When there are multiple sentences, only the first one will be returned. Note this is not an inplace operation.

Returns

A squeezed document with only one sentence.

to_conll(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp') → Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]][source]

Convert to CoNLLSentence.

Parameters
  • tok (str) – Field name for tok.

  • lem (str) – Field name for lem.

  • pos (str) – Filed name for upos.

  • dep (str) – Field name for dependency parsing.

  • sdp (str) – Field name for semantic dependency parsing.

Returns

A CoNLLSentence representation.

to_dict()[source]

Convert to a json compatible dict.

Returns

A dict representation.

to_json(ensure_ascii=False, indent=2)str[source]

Convert to json string.

Parameters
  • ensure_asciiFalse to allow for non-ascii text.

  • indent – Indent per nested structure.

Returns

A text representation in str.

to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False) → Union[str, List[str]][source]

Convert to a pretty text representation which can be printed to visualize linguistic structures.

Parameters
  • tok – Token key.

  • lem – Lemma key.

  • pos – Part-of-speech key.

  • dep – Dependency parse tree key.

  • sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

  • ner – Named entity key.

  • srl – Semantic role labeling key.

  • con – Constituency parsing key.

  • show_headerTrue to include a header which indicates each field with its name.

  • htmlTrue to output HTML format so that non-ASCII characters can align correctly.

Returns

A pretty string.

translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]

Translate tags for each annotation. This is an inplace operation.

Attention

Note that the translated document might not print well in terminal due to non-ASCII characters.

Parameters
  • lang – Target language to be translated to.

  • tok – Token key.

  • pos – Part-of-speech key.

  • dep – Dependency parse tree key.

  • sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

  • ner – Named entity key.

  • srl – Semantic role labeling key.

Returns

The translated document.