dictionary
dictionary¶
- class hanlp_trie.dictionary.DictInterface[source]¶
- split(text: Union[str, Sequence[str]]) List[Tuple[int, int, Any]] [source]¶
Like the
str.split()
, this method splits a piece of text into chunks by taking the keys in this dictionary as delimiters. It performs longest-prefix-matching on text and split it whenever a longest key is matched. Unlike thestr.split()
, it inserts matched keys into the results list right after where they are found. So that the text can be restored by joining chunks in the results list.- Parameters
text – A piece of text.
- Returns
A list of chunks, each chunk is a span of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks.
- abstract tokenize(text: Union[str, Sequence[str]]) List[Tuple[int, int, Any]] [source]¶
Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.- Parameters
text – The text to be tokenized.
- Returns
A list of tokens.
- class hanlp_trie.dictionary.TrieDict(dictionary: Optional[Union[Dict[Iterable[str], Any], Iterable[str]]] = None)[source]¶
A dict-like structure for fast custom dictionary strategies in tokenization and tagging. It is built with a dict of key-value pairs or a set of strings. When a set is passed in, it will be turned into a dict where each key is assigned with a boolean value
True
.- Parameters
dictionary – A custom dictionary of string-value pairs.
- static merge_batch(data, new_outputs, new_data_belongs, parts)[source]¶
A helper method to merge the outputs of split batch back by concatenating the output per span with the key used to split it. It’s used in conjunction with
split_batch()
.- Parameters
data – Split batch.
new_outputs – Outputs of the split batch.
new_data_belongs – Belonging information.
parts – The keys.
- Returns
Merged outputs.
- split_batch(data: List[str]) Tuple[List[str], List[int], List[List[Tuple[int, int, Any]]]] [source]¶
A handy method to perform longest-prefix-matching on a batch of sentences. It tokenize each sentence, record the chunks being either a key in the dict or a span outside of the dict. The spans are then packed into a new batch and returned along with the following information:
which sentence a span belongs to
the matched keys along with their spans and values.
This method bridges the gap between statistical models and rule-based gazetteers. It’s used in conjunction with
merge_batch()
.- Parameters
data – A batch of sentences.
- Returns
A tuple of the new batch, the belonging information and the keys.
- tokenize(text: Union[str, Sequence[str]]) List[Tuple[int, int, Any]] [source]¶
Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.- Parameters
text – The text to be tokenized.
- Returns
A list of tokens.