tokenizer¶

class hanlp.transform.transformer_tokenizer.TransformerSequenceTokenizer(tokenizer: Union[transformers.tokenization_utils.PreTrainedTokenizer, str], input_key, output_key=None, max_seq_length=512, truncate_long_sequences=False, config: Optional[transformers.configuration_utils.PretrainedConfig] = None, cls_token_at_end=False, cls_token_segment_id=0, pad_token_segment_id=0, pad_on_left=False, do_padding=False, sep_token_extra=False, ret_mask_and_type=False, ret_prefix_mask=False, ret_token_span=True, ret_subtokens=False, ret_subtokens_group=False, cls_is_bos=False, sep_is_eos=False, do_basic_tokenize=True, use_fast=True, dict_force=None, strip_cls_sep=True, check_space_before=None)[source]¶

A transformer tokenizer for token-level tasks. It honors the boundary of tokens and tokenize each token into several subtokens then merge them. The information about each subtoken belongs to which token are kept and returned as a new field in the sample. It also provides out-of-box sliding window trick on long sequences.

Parameters

tokenizer – The identifier of a pre-trained tokenizer or a PreTrainedTokenizer.
input_key – The token key in samples.
output_key – The output keys to store results. max_seq_length: Sentences longer than max_seq_len will be split into shorter ones if possible.
truncate_long_sequences – True to truncate exceeded parts of long sequences. False to enable sliding window.
config – The PretrainedConfig to determine the model structure of the transformer, so that special tokenization can be applied.
cls_token_at_end – True to put [CLS] at the end of input tokens.
cls_token_segment_id – The id of [CLS].
pad_token_segment_id – The id of [SEP].
pad_on_left – True to put [PAD] at the left side of input tokens.
do_padding – True to pad sequence to the left.
sep_token_extra – True to have two [SEP].
ret_mask_and_type – True to return masks and type ids.
ret_prefix_mask – True to generate a mask where each non-zero element corresponds to a prefix of a token.
ret_token_span – True to return span of each token measured by subtoken offsets.
ret_subtokens – True to return list of subtokens belonging to each token for tokenization purpose. When enabled, the prefix mask for each subtoken is set to True as each subtoken is a token unit in tokenization task. Similarity, the token span for each token will be a continuous integer sequence.
ret_subtokens_group – True to return list of offsets of subtokens belonging to each token.
cls_is_bos – True means the first token of input is treated as [CLS] no matter what its surface form is. False (default) means the first token is not [CLS], it will have its own embedding other than the embedding of [CLS].
sep_is_eos – True means the last token of input is [SEP]. False means it’s not but [SEP] will be appended, None means it dependents on input[-1] == [EOS].
do_basic_tokenize – Whether to do basic tokenization before wordpiece.
use_fast – Whether or not to try to load the fast version of the tokenizer.
dict_force – A dictionary doing longest-prefix-match on input text so that the head and tail of each keyword won’t be concatenated to other tokens by transformer tokenizers.
strip_cls_sep – True to strip [CLS] and [SEP] off the input tokens.
check_space_before – True to detect the space before each token to handle underline in sentence piece tokenization.

Examples:

transform = TransformerSequenceTokenizer('bert-base-uncased', 'token')
sample = {'token': 'HanLP good'.split()}
print(transform(sample))

HanLP Documentation

tokenizer

tokenizer¶