Transformer based tokenizer.

class hanlp.components.tokenizers.transformer.TransformerTaggingTokenizer(**kwargs)[source]

A tokenizer using transformer tagger for span prediction. It features with 2 high performance dictionaries to handle edge cases in real application.

  • dict_force: High priority dictionary performs longest-prefix-matching on input text which takes higher priority over model predictions.

  • dict_combine: Low priority dictionary performs longest-prefix-matching on model predictions then combines them.


For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.


**kwargs – Predefined config.


Implement this to build metric(s).


**kwargs – The subclass decides the method signature.

property dict_combine

The low priority dictionary which perform longest-prefix-matching on model predictions and combing them.


>>> tok.dict_combine = {'和服', '服务行业'}
>>> tok("商品和服务行业") # '和服' is not in the original results ['商品', '和', '服务']. '服务', '行业' are combined to '服务行业'
    ['商品', '和', '服务行业']
property dict_force

The high priority dictionary which perform longest-prefix-matching on inputs to split them into two subsets:

  1. spans containing no keywords, which are then fed into tokenizer for further tokenization.

  2. keywords, which will be outputed without furthur tokenization.


Longest-prefix-matching NEVER guarantee the presence of any keywords. Abuse of dict_force can lead to low quality results. For more details, refer to this book.


>>> tok.dict_force = {'和服', '服务行业'} # Force '和服' and '服务行业' by longest-prefix-matching
>>> tok("商品和服务行业")
    ['商品', '和服', '务行业']
>>> tok.dict_force = {'和服务': ['和', '服务']} # Force '和服务' to be tokenized as ['和', '服务']
>>> tok("商品和服务行业")
    ['商品', '和', '服务', '行业']
fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: hanlp.common.dataset.SamplerBuilder = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Union[float, int, List[int]] = None, **kwargs)[source]
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • transformer – An identifier of a pre-trained transformer.

  • average_subwordsTrue to average subword representations.

  • word_dropout – Dropout rate to randomly replace a subword with MASK.

  • hidden_dropout – Dropout rate applied to hidden states.

  • layer_dropout – Randomly zero out hidden states of a transformer layer.

  • scalar_mix – Layer attention.

  • grad_norm – Gradient norm for clipping.

  • transformer_grad_norm – Gradient norm for clipping transformer gradient.

  • lr – Learning rate for decoder.

  • transformer_lr – Learning for encoder.

  • transformer_layers – The number of bottom layers to use.

  • gradient_accumulation – Number of batches per update.

  • adam_epsilon – The epsilon to use in Adam.

  • weight_decay – The weight decay to use.

  • warmup_steps – The number of warmup steps.

  • crfTrue to enable CRF (Lafferty et al. 2001).

  • reduction – The loss reduction used in aggregating losses.

  • batch_size – The number of samples in a batch.

  • sampler_builder – The builder to build sampler, which will override batch_size.

  • epochs – The number of epochs to train.

  • patience – The number of patience epochs before early stopping.

  • token_key – The key to tokens in dataset.

  • tagging_scheme – Either BMES or BI.

  • delimiter – Delimiter between tokens used to split a line in the corpus.

  • max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

  • transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • **kwargs – Not used.


Best metrics on dev set.