tag_ner
tag_ner¶
Tagging based Named Entity Recognition.
- class hanlp.components.mtl.tasks.ner.tag_ner.TaggingNamedEntityRecognition(trn: Optional[str] = None, dev: Optional[str] = None, tst: Optional[str] = None, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, dependencies: Optional[str] = None, scalar_mix: Optional[hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder] = None, use_raw_hidden_states=False, lr=0.001, separate_optimizer=False, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, tagging_scheme=None, crf=False, delimiter_in_entity=None, merge_types: Optional[List[str]] = None, secondary_encoder=None, token_key='token', dict_whitelist: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]]] = None, dict_blacklist: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[str, Any], Set[str]]] = None, dict_tags: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[Union[str, Sequence[str]], Union[str, Sequence[str]]]]] = None, **kwargs)[source]¶
A simple tagger using a linear layer with an optional CRF (Lafferty et al. 2001) layer for NER task. It can utilize whitelist gazetteers which is dict mapping from entity name to entity type. During decoding, it performs longest-prefix-matching of these words to override the prediction from underlying statistical model. It also uses a blacklist to mask out mis-predicted entities.
Note
For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.
- Parameters
trn – Path to training set.
dev – Path to dev set.
tst – Path to test set.
sampler_builder – A builder which builds a sampler.
dependencies – Its dependencies on other tasks.
scalar_mix – A builder which builds a ScalarMixWithDropout object.
use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.
lr – Learning rate for this task.
separate_optimizer – Use customized separate optimizer for this task.
max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.token_key – The key to tokens in dataset. This should always be set to
token
in MTL.crf –
True
to enable CRF (Lafferty et al. 2001).delimiter_in_entity – The delimiter between tokens in entity, which is used to rebuild entity by joining tokens during decoding.
merge_types – The types of consecutive entities to be merged.
secondary_encoder – An optional secondary encoder to provide enhanced representation by taking the hidden states from the main encoder as input.
token_key – The key to tokens in dataset. This should always be set to
token
in MTL.dict_whitelist – A
dict
or aDictInterface
of gazetteers to be included into the final results.dict_blacklist – A
set
or aDictInterface
of badcases to be excluded from the final results.**kwargs –
- build_dataloader(data, transform: Optional[Callable] = None, training=False, device=None, logger: Optional[logging.Logger] = None, cache=False, gradient_accumulation=1, **kwargs) torch.utils.data.dataloader.DataLoader [source]¶
Build a dataloader for training or evaluation.
- Parameters
data – Either a path or a list of samples.
transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]
training – Whether this method is called on training set.
device – The device dataloader is intended to work with.
logger – Logger for printing message indicating progress.
cache – Whether the dataloader should be cached.
gradient_accumulation – Gradient accumulation to be passed to sampler builder.
**kwargs – Additional experimental arguments.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.