dictionary¶
-
class
hanlp_trie.dictionary.
DictInterface
[source]¶ -
split
(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]¶ Like the
str.split()
, this method splits a piece of text into chunks by taking the keys in this dictionary as delimiters. It performs longest-prefix-matching on text and split it whenever a longest key is matched. Unlike thestr.split()
, it inserts matched keys into the results list right after where they are found. So that the text can be restored by joining chunks in the results list.- Parameters
text – A piece of text.
- Returns
A list of chunks, each chunk is a span of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks.
-
abstract
tokenize
(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]¶ Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.- Parameters
text – The text to be tokenized.
- Returns
A list of tokens.
-
-
class
hanlp_trie.dictionary.
TrieDict
(dictionary: Optional[Union[Dict[str, Any], Iterable[str]]] = None)[source]¶ A dict-like structure for fast custom dictionary strategies in tokenization and tagging. It is built with a dict of key-value pairs or a set of strings. When a set is passed in, it will be turned into a dict where each key is assigned with a boolean value
True
.- Parameters
dictionary – A custom dictionary of string-value pairs.
-
static
merge_batch
(data, new_outputs, new_data_belongs, parts)[source]¶ A helper method to merge the outputs of split batch back by concatenating the output per span with the key used to split it. It’s used in conjunction with
split_batch()
.- Parameters
data – Split batch.
new_outputs – Outputs of the split batch.
new_data_belongs – Belonging information.
parts – The keys.
- Returns
Merged outputs.
-
split_batch
(data: List[str]) → Tuple[List[str], List[int], List[List[Tuple[int, int, Any]]]][source]¶ A handy method to perform longest-prefix-matching on a batch of sentences. It tokenize each sentence, record the chunks being either a key in the dict or a span outside of the dict. The spans are then packed into a new batch and returned along with the following information:
which sentence a span belongs to
the matched keys along with their spans and values.
This method bridges the gap between statistical models and rule-based gazetteers. It’s used in conjunction with
merge_batch()
.- Parameters
data – A batch of sentences.
- Returns
A tuple of the new batch, the belonging information and the keys.
-
tokenize
(text: Union[str, Sequence[str]]) → List[Tuple[int, int, Any]][source]¶ Implement this method to tokenize a piece of text into a list of non-intersect spans, each span is a tuple of
(begin_offset, end_offset, label)
, where label is some properties related to this span and downstream tasks have the freedom to define what kind of labels they want.- Parameters
text – The text to be tokenized.
- Returns
A list of tokens.