Transformer based Multi-Criteria Word tokenizer.

class hanlp.components.tokenizers.multi_criteria_cws_transformer.MultiCriteriaTransformerTaggingTokenizer(**kwargs)[source]

Transformer based implementation of “Effective Neural Solution for Multi-Criteria Word Segmentation” (He et al. 2019). It uses an artificial token [unused_i] instead of [SEP] in the input_ids to mark the i-th segmentation criteria.


**kwargs – Not used.


Implement this to build metric(s).


**kwargs – The subclass decides the method signature.

build_vocabs(trn, logger, **kwargs)[source]

Override this method to build vocabs.

  • trn – Training set.

  • logger – Logger for reporting progress.

fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, mix_embedding: int = 0, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • transformer – An identifier of a pre-trained transformer.

  • average_subwordsTrue to average subword representations.

  • word_dropout – Dropout rate to randomly replace a subword with MASK.

  • hidden_dropout – Dropout rate applied to hidden states.

  • layer_dropout – Randomly zero out hidden states of a transformer layer.

  • scalar_mix – Layer attention.

  • grad_norm – Gradient norm for clipping.

  • transformer_grad_norm – Gradient norm for clipping transformer gradient.

  • lr – Learning rate for decoder.

  • transformer_lr – Learning for encoder.

  • transformer_layers – The number of bottom layers to use.

  • gradient_accumulation – Number of batches per update.

  • adam_epsilon – The epsilon to use in Adam.

  • weight_decay – The weight decay to use.

  • warmup_steps – The number of warmup steps.

  • crfTrue to enable CRF (Lafferty et al. 2001).

  • reduction – The loss reduction used in aggregating losses.

  • batch_size – The number of samples in a batch.

  • sampler_builder – The builder to build sampler, which will override batch_size.

  • epochs – The number of epochs to train.

  • patience – The number of patience epochs before early stopping.

  • token_key – The key to tokens in dataset.

  • tagging_scheme – Either BMES or BI.

  • delimiter – Delimiter between tokens used to split a line in the corpus.

  • max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

  • transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • **kwargs – Not used.


Best metrics on dev set.


Called when config is ready, either during fit or load. Subclass can perform extra initialization tasks in this callback.


**kwargs – Not used.