dictionary

class hanlp_trie.dictionary.DictInterface[source]
split(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]

Like the str.split(), this method splits a piece of text into chunks by taking the keys in this dictionary as delimiters. It performs longest-prefix-matching on text and split it whenever a longest key is matched. Unlike the str.split(), it inserts matched keys into the results list right after where they are found. So that the text can be restored by joining chunks in the results list.

Parameters

text – A piece of text.

Returns

A list of chunks, each chunk is a span of (begin_offset, end_offset, label), where label is some properties related to this span and downstream tasks.

abstract tokenize(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]

Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of (begin_offset, end_offset, label), where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.

Parameters

text – The text to be tokenized.

Returns

A list of tokens.

class hanlp_trie.dictionary.TrieDict(dictionary: Optional[Union[Dict[str, Any], Iterable[str]]] = None)[source]

A dict-like structure for fast custom dictionary strategies in tokenization and tagging. It is built with a dict of key-value pairs or a set of strings. When a set is passed in, it will be turned into a dict where each key is assigned with a boolean value True.

Parameters

dictionary – A custom dictionary of string-value pairs.

static merge_batch(data, new_outputs, new_data_belongs, parts)[source]

A helper method to merge the outputs of split batch back by concatenating the output per span with the key used to split it. It’s used in conjunction with split_batch().

Parameters
  • data – Split batch.

  • new_outputs – Outputs of the split batch.

  • new_data_belongs – Belonging information.

  • parts – The keys.

Returns

Merged outputs.

split_batch(data: List[str]) → Tuple[List[str], List[int], List[List[Tuple[int, int, Any]]]][source]

A handy method to perform longest-prefix-matching on a batch of sentences. It tokenize each sentence, record the chunks being either a key in the dict or a span outside of the dict. The spans are then packed into a new batch and returned along with the following information:

  • which sentence a span belongs to

  • the matched keys along with their spans and values.

This method bridges the gap between statistical models and rule-based gazetteers. It’s used in conjunction with merge_batch().

Parameters

data – A batch of sentences.

Returns

A tuple of the new batch, the belonging information and the keys.

tokenize(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]

Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of (begin_offset, end_offset, label), where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.

Parameters

text – The text to be tokenized.

Returns

A list of tokens.