tag_ner

Tagging based Named Entity Recognition.

class hanlp.components.mtl.tasks.ner.tag_ner.TaggingNamedEntityRecognition(trn: str = None, dev: str = None, tst: str = None, sampler_builder: hanlp.common.dataset.SamplerBuilder = None, dependencies: str = None, scalar_mix: hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder = None, use_raw_hidden_states=False, lr=0.001, separate_optimizer=False, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, tagging_scheme=None, crf=False, delimiter_in_entity=None, merge_types: List[str] = None, secondary_encoder=None, token_key='token', dict_whitelist: Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]] = None, dict_blacklist: Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]] = None, **kwargs)[source]

A simple tagger using a linear layer with an optional CRF (Lafferty et al. 2001) layer for NER task. It can utilize whitelist gazetteers which is dict mapping from entity name to entity type. During decoding, it performs longest-prefix-matching of these words to override the prediction from underlining statistical model. It also uses a blacklist to mask out mis-predicted entities.

Note

For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.

Parameters
  • trn – Path to training set.

  • dev – Path to dev set.

  • tst – Path to test set.

  • sampler_builder – A builder which builds a sampler.

  • dependencies – Its dependencies on other tasks.

  • scalar_mix – A builder which builds a ScalarMixWithDropout object.

  • use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.

  • lr – Learning rate for this task.

  • separate_optimizer – Use customized separate optimizer for this task.

  • max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

  • token_key – The key to tokens in dataset. This should always be set to token in MTL.

  • crfTrue to enable CRF (Lafferty et al. 2001).

  • delimiter_in_entity – The delimiter between tokens in entity, which is used to rebuild entity by joining tokens during decoding.

  • merge_types – The types of consecutive entities to be merged.

  • secondary_encoder – An optional secondary encoder to provide enhanced representation by taking the hidden states from the main encoder as input.

  • token_key – The key to tokens in dataset. This should always be set to token in MTL.

  • dict_whitelist – A dict or a DictInterface of gazetteers to be included into the final results.

  • dict_blacklist – A set or a DictInterface of badcases to be excluded from the final results.

  • **kwargs

build_dataloader(data, transform: Callable = None, training=False, device=None, logger: logging.Logger = None, cache=False, gradient_accumulation=1, **kwargs) → torch.utils.data.dataloader.DataLoader[source]

Build a dataloader for training or evaluation.

Parameters
  • data – Either a path or a list of samples.

  • transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]

  • training – Whether this method is called on training set.

  • device – The device dataloader is intended to work with.

  • logger – Logger for printing message indicating progress.

  • cache – Whether the dataloader should be cached.

  • gradient_accumulation – Gradient accumulation to be passed to sampler builder.

  • **kwargs – Additional experimental arguments.

build_metric(**kwargs)[source]

Implement this to build metric(s).

Parameters

**kwargs – The subclass decides the method signature.

build_model(encoder_size, training=True, **kwargs) → torch.nn.modules.module.Module[source]

Build model.

Parameters
  • trainingTrue if called during training.

  • **kwargs**self.config.

input_is_flat(data)bool[source]

Check whether the data is flat (meaning that it’s only a single sample, not even batched).

Returns

True to indicate the input data is flat.

Return type

bool