multi_criteria¶

Transformer based Multi-Criteria Word tokenizer.

class hanlp.components.tokenizers.multi_criteria_cws_transformer.MultiCriteriaTransformerTaggingTokenizer(**kwargs)[source]¶

Transformer based implementation of “Effective Neural Solution for Multi-Criteria Word Segmentation” (He et al. 2019). It uses an artificial token [unused_i] instead of [SEP] in the input_ids to mark the i-th segmentation criteria.

Parameters: **kwargs – Not used.

build_metric(**kwargs)[source]¶

Implement this to build metric(s).

Parameters: **kwargs – The subclass decides the method signature.

build_vocabs(trn, logger, **kwargs)[source]¶

Override this method to build vocabs.

Parameters

trn – Training set.
logger – Logger for reporting progress.

fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, mix_embedding: int = 0, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]¶

Parameters

trn_data – Training set.
dev_data – Development set.
save_dir – The directory to save trained component.
transformer – An identifier of a pre-trained transformer.
average_subwords – True to average subword representations.
word_dropout – Dropout rate to randomly replace a subword with MASK.
hidden_dropout – Dropout rate applied to hidden states.
layer_dropout – Randomly zero out hidden states of a transformer layer.
scalar_mix – Layer attention.
grad_norm – Gradient norm for clipping.
transformer_grad_norm – Gradient norm for clipping transformer gradient.
lr – Learning rate for decoder.
transformer_lr – Learning for encoder.
transformer_layers – The number of bottom layers to use.
gradient_accumulation – Number of batches per update.
adam_epsilon – The epsilon to use in Adam.
weight_decay – The weight decay to use.
warmup_steps – The number of warmup steps.
crf – True to enable CRF (Lafferty et al. 2001).
reduction – The loss reduction used in aggregating losses.
batch_size – The number of samples in a batch.
sampler_builder – The builder to build sampler, which will override batch_size.
epochs – The number of epochs to train.
patience – The number of patience epochs before early stopping.
token_key – The key to tokens in dataset.
tagging_scheme – Either BMES or BI.
delimiter – Delimiter between tokens used to split a line in the corpus.
max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.
sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.
transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.
devices – Devices this component will live on.
logger – Any logging.Logger instance.
seed – Random seed to reproduce this training.
**kwargs – Not used.

Returns

Best metrics on dev set.

on_config_ready(**kwargs)[source]¶

Called when config is ready, either during fit or load. Subclass can perform extra initialization tasks in this callback.

Parameters: **kwargs – Not used.

HanLP Documentation

multi_criteria

multi_criteria¶