tokenizer
tokenizer¶
- class hanlp.transform.transformer_tokenizer.TransformerSequenceTokenizer(tokenizer: Union[transformers.tokenization_utils.PreTrainedTokenizer, str], input_key, output_key=None, max_seq_length=512, truncate_long_sequences=False, config: Optional[transformers.configuration_utils.PretrainedConfig] = None, cls_token_at_end=False, cls_token_segment_id=0, pad_token_segment_id=0, pad_on_left=False, do_padding=False, sep_token_extra=False, ret_mask_and_type=False, ret_prefix_mask=False, ret_token_span=True, ret_subtokens=False, ret_subtokens_group=False, cls_is_bos=False, sep_is_eos=False, do_basic_tokenize=True, use_fast=True, dict_force=None, strip_cls_sep=True, check_space_before=None)[source]¶
A transformer tokenizer for token-level tasks. It honors the boundary of tokens and tokenize each token into several subtokens then merge them. The information about each subtoken belongs to which token are kept and returned as a new field in the sample. It also provides out-of-box sliding window trick on long sequences.
- Parameters
tokenizer – The identifier of a pre-trained tokenizer or a
PreTrainedTokenizer
.input_key – The token key in samples.
output_key – The output keys to store results. max_seq_length: Sentences longer than
max_seq_len
will be split into shorter ones if possible.truncate_long_sequences –
True
to truncate exceeded parts of long sequences.False
to enable sliding window.config – The
PretrainedConfig
to determine the model structure of the transformer, so that special tokenization can be applied.cls_token_at_end –
True
to put[CLS]
at the end of input tokens.cls_token_segment_id – The id of
[CLS]
.pad_token_segment_id – The id of
[SEP]
.pad_on_left –
True
to put[PAD]
at the left side of input tokens.do_padding –
True
to pad sequence to the left.sep_token_extra –
True
to have two[SEP]
.ret_mask_and_type –
True
to return masks and type ids.ret_prefix_mask –
True
to generate a mask where each non-zero element corresponds to a prefix of a token.ret_token_span –
True
to return span of each token measured by subtoken offsets.ret_subtokens –
True
to return list of subtokens belonging to each token for tokenization purpose. When enabled, the prefix mask for each subtoken is set to True as each subtoken is a token unit in tokenization task. Similarity, the token span for each token will be a continuous integer sequence.ret_subtokens_group –
True
to return list of offsets of subtokens belonging to each token.cls_is_bos –
True
means the first token of input is treated as [CLS] no matter what its surface form is.False
(default) means the first token is not [CLS], it will have its own embedding other than the embedding of [CLS].sep_is_eos –
True
means the last token of input is [SEP].False
means it’s not but [SEP] will be appended,None
means it dependents on input[-1] == [EOS].do_basic_tokenize – Whether to do basic tokenization before wordpiece.
use_fast – Whether or not to try to load the fast version of the tokenizer.
dict_force – A dictionary doing longest-prefix-match on input text so that the head and tail of each keyword won’t be concatenated to other tokens by transformer tokenizers.
strip_cls_sep –
True
to strip [CLS] and [SEP] off the input tokens.check_space_before –
True
to detect the space before each token to handle underline in sentence piece tokenization.
Examples:
transform = TransformerSequenceTokenizer('bert-base-uncased', 'token') sample = {'token': 'HanLP good'.split()} print(transform(sample))