Transformer based tokenizer.

class hanlp.components.tokenizers.transformer.TransformerTaggingTokenizer(**kwargs)[source]

A tokenizer using transformer tagger for span prediction. It features with 2 high performance dictionaries to handle edge cases in real application.

  • dict_force: High priority dictionary performs longest-prefix-matching on input text which takes higher priority over model predictions.

  • dict_combine: Low priority dictionary performs longest-prefix-matching on model predictions then combines them.


For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.

It also supports outputting the span of each token by setting config.output_spans = True.


**kwargs – Predefined config.


Implement this to build metric(s).


**kwargs – The subclass decides the method signature.

property dict_combine: hanlp_trie.dictionary.DictInterface

The low priority dictionary which perform longest-prefix-matching on model predictions and combing them.


>>> tok.dict_combine = {'和服', '服务行业'}
>>> tok("商品和服务行业") # '和服' is not in the original results ['商品', '和', '服务']. '服务', '行业' are combined to '服务行业'
    ['商品', '和', '服务行业']
property dict_force: hanlp_trie.dictionary.DictInterface

The high priority dictionary which perform longest-prefix-matching on inputs to split them into two subsets:

  1. spans containing no keywords, which are then fed into tokenizer for further tokenization.

  2. keywords, which will be outputed without furthur tokenization.


Longest-prefix-matching NEVER guarantee the presence of any keywords. Abuse of dict_force can lead to low quality results. For more details, refer to this book.


>>> tok.dict_force = {'和服', '服务行业'} # Force '和服' and '服务行业' by longest-prefix-matching
>>> tok("商品和服务行业")
    ['商品', '和服', '务行业']
>>> tok.dict_force = {'和服务': ['和', '服务']} # Force '和服务' to be tokenized as ['和', '服务']
>>> tok("商品和服务行业")
    ['商品', '和', '服务', '行业']
fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, eval_trn=True, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • transformer – An identifier of a pre-trained transformer.

  • average_subwordsTrue to average subword representations.

  • word_dropout – Dropout rate to randomly replace a subword with MASK.

  • hidden_dropout – Dropout rate applied to hidden states.

  • layer_dropout – Randomly zero out hidden states of a transformer layer.

  • scalar_mix – Layer attention.

  • grad_norm – Gradient norm for clipping.

  • transformer_grad_norm – Gradient norm for clipping transformer gradient.

  • lr – Learning rate for decoder.

  • transformer_lr – Learning for encoder.

  • transformer_layers – The number of bottom layers to use.

  • gradient_accumulation – Number of batches per update.

  • adam_epsilon – The epsilon to use in Adam.

  • weight_decay – The weight decay to use.

  • warmup_steps – The number of warmup steps.

  • crfTrue to enable CRF (Lafferty et al. 2001).

  • reduction – The loss reduction used in aggregating losses.

  • batch_size – The number of samples in a batch.

  • sampler_builder – The builder to build sampler, which will override batch_size.

  • epochs – The number of epochs to train.

  • patience – The number of patience epochs before early stopping.

  • token_key – The key to tokens in dataset.

  • tagging_scheme – Either BMES or BI.

  • delimiter – Delimiter between tokens used to split a line in the corpus.

  • max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

  • transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • **kwargs – Not used.


Best metrics on dev set.