multi_criteria
multi_criteria¶
Transformer based Multi-Criteria Word tokenizer.
- class hanlp.components.tokenizers.multi_criteria_cws_transformer.MultiCriteriaTransformerTaggingTokenizer(**kwargs)[source]¶
Transformer based implementation of “Effective Neural Solution for Multi-Criteria Word Segmentation” (He et al. 2019). It uses an artificial token
[unused_i]
instead of[SEP]
in the input_ids to mark the i-th segmentation criteria.- Parameters
**kwargs – Not used.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.
- build_vocabs(trn, logger, **kwargs)[source]¶
Override this method to build vocabs.
- Parameters
trn – Training set.
logger – Logger for reporting progress.
- fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, mix_embedding: int = 0, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]¶
- Parameters
trn_data – Training set.
dev_data – Development set.
save_dir – The directory to save trained component.
transformer – An identifier of a pre-trained transformer.
average_subwords –
True
to average subword representations.word_dropout – Dropout rate to randomly replace a subword with MASK.
hidden_dropout – Dropout rate applied to hidden states.
layer_dropout – Randomly zero out hidden states of a transformer layer.
scalar_mix – Layer attention.
grad_norm – Gradient norm for clipping.
transformer_grad_norm – Gradient norm for clipping transformer gradient.
lr – Learning rate for decoder.
transformer_lr – Learning for encoder.
transformer_layers – The number of bottom layers to use.
gradient_accumulation – Number of batches per update.
adam_epsilon – The epsilon to use in Adam.
weight_decay – The weight decay to use.
warmup_steps – The number of warmup steps.
crf –
True
to enable CRF (Lafferty et al. 2001).reduction – The loss reduction used in aggregating losses.
batch_size – The number of samples in a batch.
sampler_builder – The builder to build sampler, which will override batch_size.
epochs – The number of epochs to train.
patience – The number of patience epochs before early stopping.
token_key – The key to tokens in dataset.
tagging_scheme – Either
BMES
orBI
.delimiter – Delimiter between tokens used to split a line in the corpus.
max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.
devices – Devices this component will live on.
logger – Any
logging.Logger
instance.seed – Random seed to reproduce this training.
**kwargs – Not used.
- Returns
Best metrics on dev set.