tokenizer

tokenizer

class hanlp.transform.transformer_tokenizer.TransformerSequenceTokenizer(tokenizer: Union[transformers.tokenization_utils.PreTrainedTokenizer, str], input_key, output_key=None, max_seq_length=512, truncate_long_sequences=False, config: Optional[transformers.configuration_utils.PretrainedConfig] = None, cls_token_at_end=False, cls_token_segment_id=0, pad_token_segment_id=0, pad_on_left=False, do_padding=False, sep_token_extra=False, ret_mask_and_type=False, ret_prefix_mask=False, ret_token_span=True, ret_subtokens=False, ret_subtokens_group=False, cls_is_bos=False, sep_is_eos=False, do_basic_tokenize=True, use_fast=True, dict_force=None, strip_cls_sep=True, check_space_before=None)[source]

A transformer tokenizer for token-level tasks. It honors the boundary of tokens and tokenize each token into several subtokens then merge them. The information about each subtoken belongs to which token are kept and returned as a new field in the sample. It also provides out-of-box sliding window trick on long sequences.

Parameters
  • tokenizer – The identifier of a pre-trained tokenizer or a PreTrainedTokenizer.

  • input_key – The token key in samples.

  • output_key – The output keys to store results. max_seq_length: Sentences longer than max_seq_len will be split into shorter ones if possible.

  • truncate_long_sequencesTrue to truncate exceeded parts of long sequences. False to enable sliding window.

  • config – The PretrainedConfig to determine the model structure of the transformer, so that special tokenization can be applied.

  • cls_token_at_endTrue to put [CLS] at the end of input tokens.

  • cls_token_segment_id – The id of [CLS].

  • pad_token_segment_id – The id of [SEP].

  • pad_on_leftTrue to put [PAD] at the left side of input tokens.

  • do_paddingTrue to pad sequence to the left.

  • sep_token_extraTrue to have two [SEP].

  • ret_mask_and_typeTrue to return masks and type ids.

  • ret_prefix_maskTrue to generate a mask where each non-zero element corresponds to a prefix of a token.

  • ret_token_spanTrue to return span of each token measured by subtoken offsets.

  • ret_subtokensTrue to return list of subtokens belonging to each token for tokenization purpose. When enabled, the prefix mask for each subtoken is set to True as each subtoken is a token unit in tokenization task. Similarity, the token span for each token will be a continuous integer sequence.

  • ret_subtokens_groupTrue to return list of offsets of subtokens belonging to each token.

  • cls_is_bosTrue means the first token of input is treated as [CLS] no matter what its surface form is. False (default) means the first token is not [CLS], it will have its own embedding other than the embedding of [CLS].

  • sep_is_eosTrue means the last token of input is [SEP]. False means it’s not but [SEP] will be appended, None means it dependents on input[-1] == [EOS].

  • do_basic_tokenize – Whether to do basic tokenization before wordpiece.

  • use_fast – Whether or not to try to load the fast version of the tokenizer.

  • dict_force – A dictionary doing longest-prefix-match on input text so that the head and tail of each keyword won’t be concatenated to other tokens by transformer tokenizers.

  • strip_cls_sepTrue to strip [CLS] and [SEP] off the input tokens.

  • check_space_beforeTrue to detect the space before each token to handle underline in sentence piece tokenization.

Examples:

transform = TransformerSequenceTokenizer('bert-base-uncased', 'token')
sample = {'token': 'HanLP good'.split()}
print(transform(sample))