transformer_ner

transformer_ner

Tagging based Named Entity Recognition.

class hanlp.components.ner.transformer_ner.TransformerNamedEntityRecognizer(**kwargs)[source]

A simple tagger using transformers and a linear layer with an optional CRF (Lafferty et al. 2001) layer for NER task. It can utilize whitelist gazetteers which is dict mapping from entity name to entity type. During decoding, it performs longest-prefix-matching of these words to override the prediction from underlying statistical model. It also uses a blacklist to mask out mis-predicted entities.

Note

For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.

Parameters

**kwargs – Not used.

build_metric(**kwargs)[source]

Implement this to build metric(s).

Parameters

**kwargs – The subclass decides the method signature.

build_vocabs(trn, logger, **kwargs)[source]

Override this method to build vocabs.

Parameters
  • trn – Training set.

  • logger – Logger for reporting progress.

fit(trn_data, dev_data, save_dir, transformer, delimiter_in_entity=None, merge_types: Optional[List[str]] = None, average_subwords=False, word_dropout: float = 0.2, hidden_dropout=None, layer_dropout=0, scalar_mix=None, grad_norm=5.0, lr=5e-05, transformer_lr=None, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, crf=False, secondary_encoder=None, reduction='sum', batch_size=32, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, epochs=3, tagset=None, token_key='token', max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, transform=None, logger=None, seed=None, devices: Optional[Union[float, int, List[int]]] = None, **kwargs)[source]

Fit component to training set.

Parameters
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • transformer – An identifier of a pre-trained transformer.

  • delimiter_in_entity – The delimiter between tokens in entity, which is used to rebuild entity by joining tokens during decoding.

  • merge_types – The types of consecutive entities to be merged.

  • average_subwordsTrue to average subword representations.

  • word_dropout – Dropout rate to randomly replace a subword with MASK.

  • hidden_dropout – Dropout rate applied to hidden states.

  • layer_dropout – Randomly zero out hidden states of a transformer layer.

  • scalar_mix – Layer attention.

  • grad_norm – Gradient norm for clipping.

  • lr – Learning rate for decoder.

  • transformer_lr – Learning for encoder.

  • adam_epsilon – The epsilon to use in Adam.

  • weight_decay – The weight decay to use.

  • warmup_steps – The number of warmup steps.

  • crfTrue to enable CRF (Lafferty et al. 2001).

  • secondary_encoder – An optional secondary encoder to provide enhanced representation by taking the hidden states from the main encoder as input.

  • reduction – The loss reduction used in aggregating losses.

  • batch_size – The number of samples in a batch.

  • sampler_builder – The builder to build sampler, which will override batch_size.

  • epochs – The number of epochs to train.

  • tagset – Optional tagset to prune entities outside of this tagset from datasets.

  • token_key – The key to tokens in dataset.

  • max_seq_len – The maximum sequence length. Sequence longer than this will be handled by sliding window.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

  • transform – An optional transform to be applied to samples. Usually a character normalization transform is passed in.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • **kwargs – Not used.

Returns

The best metrics on training set.