pos
pos¶
Part-of-speech tagging.
- class hanlp.components.mtl.tasks.pos.TransformerTagging(trn: Optional[str] = None, dev: Optional[str] = None, tst: Optional[str] = None, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, dependencies: Optional[str] = None, scalar_mix: Optional[hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder] = None, use_raw_hidden_states=False, lr=0.001, separate_optimizer=False, cls_is_bos=False, sep_is_eos=False, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, crf=False, token_key='token', dict_tags: Optional[Union[hanlp_trie.dictionary.DictInterface, Dict[Union[str, Sequence[str]], Union[str, Sequence[str]]]]] = None, **kwargs)[source]¶
A simple tagger using a linear layer with an optional CRF (Lafferty et al. 2001) layer for any tagging tasks including PoS tagging and many others. It also features with a custom dictionary
dict_tags
to performlongest-prefix-matching
which replaces matched tokens with given tags.Note
For algorithm beginners, longest-prefix-matching is the prerequisite to understand what dictionary can do and what it can’t do. The tutorial in this book can be very helpful.
- Parameters
trn – Path to training set.
dev – Path to dev set.
tst – Path to test set.
sampler_builder – A builder which builds a sampler.
dependencies – Its dependencies on other tasks.
scalar_mix – A builder which builds a ScalarMixWithDropout object.
use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.
lr – Learning rate for this task.
separate_optimizer – Use customized separate optimizer for this task.
cls_is_bos –
True
to treat the first token asBOS
.sep_is_eos –
True
to treat the last token asEOS
.max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.crf –
True
to enable CRF (Lafferty et al. 2001).token_key – The key to tokens in dataset. This should always be set to
token
in MTL.dict_tags – A custom dictionary to override predicted tags by performing longest-prefix-matching.
**kwargs – Not used.
- build_dataloader(data, transform: Optional[Callable] = None, training=False, device=None, logger: Optional[logging.Logger] = None, cache=False, gradient_accumulation=1, **kwargs) torch.utils.data.dataloader.DataLoader [source]¶
Build a dataloader for training or evaluation.
- Parameters
data – Either a path or a list of samples.
transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]
training – Whether this method is called on training set.
device – The device dataloader is intended to work with.
logger – Logger for printing message indicating progress.
cache – Whether the dataloader should be cached.
gradient_accumulation – Gradient accumulation to be passed to sampler builder.
**kwargs – Additional experimental arguments.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.