transformer
transformer¶
Transformer based tokenizer.
- class hanlp.components.tokenizers.transformer.TransformerTaggingTokenizer(**kwargs)[source]¶
A tokenizer using transformer tagger for span prediction. It features with 2 high performance dictionaries to handle edge cases in real application.
dict_force
: High priority dictionary performs longest-prefix-matching on input text which takes higher priority over model predictions.dict_combine
: Low priority dictionary performs longest-prefix-matching on model predictions then combines them.
Note
For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.
It also supports outputting the span of each token by setting
config.output_spans = True
.- Parameters
**kwargs – Predefined config.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.
- property dict_combine: hanlp_trie.dictionary.DictInterface¶
The low priority dictionary which perform longest-prefix-matching on model predictions and combing them.
Examples
>>> tok.dict_combine = {'和服', '服务行业'} >>> tok("商品和服务行业") # '和服' is not in the original results ['商品', '和', '服务']. '服务', '行业' are combined to '服务行业' ['商品', '和', '服务行业']
- property dict_force: hanlp_trie.dictionary.DictInterface¶
The high priority dictionary which perform longest-prefix-matching on inputs to split them into two subsets:
spans containing no keywords, which are then fed into tokenizer for further tokenization.
keywords, which will be outputed without furthur tokenization.
Caution
Longest-prefix-matching NEVER guarantee the presence of any keywords. Abuse of
dict_force
can lead to low quality results. For more details, refer to this book.Examples
>>> tok.dict_force = {'和服', '服务行业'} # Force '和服' and '服务行业' by longest-prefix-matching >>> tok("商品和服务行业") ['商品', '和服', '务行业'] >>> tok.dict_force = {'和服务': ['和', '服务']} # Force '和服务' to be tokenized as ['和', '服务'] >>> tok("商品和服务行业") ['商品', '和', '服务', '行业']
- fit(trn_data, dev_data, save_dir, transformer, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, grad_norm=5.0, transformer_grad_norm=None, lr=5e-05, eval_trn=True, transformer_lr=None, transformer_layers=None, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=30, patience=5, token_key=None, tagging_scheme='BMES', delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]¶
- Parameters
trn_data – Training set.
dev_data – Development set.
save_dir – The directory to save trained component.
transformer – An identifier of a pre-trained transformer.
average_subwords –
True
to average subword representations.word_dropout – Dropout rate to randomly replace a subword with MASK.
hidden_dropout – Dropout rate applied to hidden states.
layer_dropout – Randomly zero out hidden states of a transformer layer.
scalar_mix – Layer attention.
grad_norm – Gradient norm for clipping.
transformer_grad_norm – Gradient norm for clipping transformer gradient.
lr – Learning rate for decoder.
transformer_lr – Learning for encoder.
transformer_layers – The number of bottom layers to use.
gradient_accumulation – Number of batches per update.
adam_epsilon – The epsilon to use in Adam.
weight_decay – The weight decay to use.
warmup_steps – The number of warmup steps.
crf –
True
to enable CRF (Lafferty et al. 2001).reduction – The loss reduction used in aggregating losses.
batch_size – The number of samples in a batch.
sampler_builder – The builder to build sampler, which will override batch_size.
epochs – The number of epochs to train.
patience – The number of patience epochs before early stopping.
token_key – The key to tokens in dataset.
tagging_scheme – Either
BMES
orBI
.delimiter – Delimiter between tokens used to split a line in the corpus.
max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.
devices – Devices this component will live on.
logger – Any
logging.Logger
instance.seed – Random seed to reproduce this training.
**kwargs – Not used.
- Returns
Best metrics on dev set.