tokenizer
tokenizer¶
- class hanlp.transform.transformer_tokenizer.TransformerSequenceTokenizer(tokenizer: Union[transformers.tokenization_utils.PreTrainedTokenizer, str], input_key, output_key=None, max_seq_length=512, truncate_long_sequences=False, config: Optional[transformers.configuration_utils.PretrainedConfig] = None, cls_token_at_end=False, cls_token_segment_id=0, pad_token_segment_id=0, pad_on_left=False, do_padding=False, sep_token_extra=False, ret_mask_and_type=False, ret_prefix_mask=False, ret_token_span=True, ret_subtokens=False, ret_subtokens_group=False, cls_is_bos=False, sep_is_eos=False, do_basic_tokenize=True, use_fast=True, dict_force=None, strip_cls_sep=True, check_space_before=None)[source]¶
A transformer tokenizer for token-level tasks. It honors the boundary of tokens and tokenize each token into several subtokens then merge them. The information about each subtoken belongs to which token are kept and returned as a new field in the sample. It also provides out-of-box sliding window trick on long sequences.
- Parameters
tokenizer – The identifier of a pre-trained tokenizer or a
PreTrainedTokenizer.input_key – The token key in samples.
output_key – The output keys to store results. max_seq_length: Sentences longer than
max_seq_lenwill be split into shorter ones if possible.truncate_long_sequences –
Trueto truncate exceeded parts of long sequences.Falseto enable sliding window.config – The
PretrainedConfigto determine the model structure of the transformer, so that special tokenization can be applied.cls_token_at_end –
Trueto put[CLS]at the end of input tokens.cls_token_segment_id – The id of
[CLS].pad_token_segment_id – The id of
[SEP].pad_on_left –
Trueto put[PAD]at the left side of input tokens.do_padding –
Trueto pad sequence to the left.sep_token_extra –
Trueto have two[SEP].ret_mask_and_type –
Trueto return masks and type ids.ret_prefix_mask –
Trueto generate a mask where each non-zero element corresponds to a prefix of a token.ret_token_span –
Trueto return span of each token measured by subtoken offsets.ret_subtokens –
Trueto return list of subtokens belonging to each token for tokenization purpose. When enabled, the prefix mask for each subtoken is set to True as each subtoken is a token unit in tokenization task. Similarity, the token span for each token will be a continuous integer sequence.ret_subtokens_group –
Trueto return list of offsets of subtokens belonging to each token.cls_is_bos –
Truemeans the first token of input is treated as [CLS] no matter what its surface form is.False(default) means the first token is not [CLS], it will have its own embedding other than the embedding of [CLS].sep_is_eos –
Truemeans the last token of input is [SEP].Falsemeans it’s not but [SEP] will be appended,Nonemeans it dependents on input[-1] == [EOS].do_basic_tokenize – Whether to do basic tokenization before wordpiece.
use_fast – Whether or not to try to load the fast version of the tokenizer.
dict_force – A dictionary doing longest-prefix-match on input text so that the head and tail of each keyword won’t be concatenated to other tokens by transformer tokenizers.
strip_cls_sep –
Trueto strip [CLS] and [SEP] off the input tokens.check_space_before –
Trueto detect the space before each token to handle underline in sentence piece tokenization.
Examples:
transform = TransformerSequenceTokenizer('bert-base-uncased', 'token') sample = {'token': 'HanLP good'.split()} print(transform(sample))