lem
lem¶
Lemmatization.
- class hanlp.components.mtl.tasks.lem.TransformerLemmatization(trn: Optional[str] = None, dev: Optional[str] = None, tst: Optional[str] = None, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, dependencies: Optional[str] = None, scalar_mix: Optional[hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder] = None, use_raw_hidden_states=False, lr=0.001, separate_optimizer=False, cls_is_bos=False, sep_is_eos=False, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, token_key='token', **kwargs)[source]¶
- Transition based lemmatization (Kondratyuk & Straka 2019). - Parameters
- trn – Path to training set. 
- dev – Path to dev set. 
- tst – Path to test set. 
- sampler_builder – A builder which builds a sampler. 
- dependencies – Its dependencies on other tasks. 
- scalar_mix – A builder which builds a ScalarMixWithDropout object. 
- use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling. 
- lr – Learning rate for this task. 
- separate_optimizer – Use customized separate optimizer for this task. 
- cls_is_bos – - Trueto treat the first token as- BOS.
- sep_is_eos – - Trueto treat the last token as- EOS.
- max_seq_len – Sentences longer than - max_seq_lenwill be split into shorter ones if possible.
- sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here. 
- char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization. 
- hard_constraint – Whether to enforce hard length constraint on sentences. If there is no - sent_delimiterin a sentence, it will be split at a token anyway.
- token_key – The key to tokens in dataset. This should always be set to - tokenin MTL.
- **kwargs – Not used. 
 
 - build_dataloader(data: List[List[str]], transform: Optional[Callable] = None, training=False, device=None, logger: Optional[logging.Logger] = None, cache=False, gradient_accumulation=1, **kwargs) torch.utils.data.dataloader.DataLoader[source]¶
- Build a dataloader for training or evaluation. - Parameters
- data – Either a path or a list of samples. 
- transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)] 
- training – Whether this method is called on training set. 
- device – The device dataloader is intended to work with. 
- logger – Logger for printing message indicating progress. 
- cache – Whether the dataloader should be cached. 
- gradient_accumulation – Gradient accumulation to be passed to sampler builder. 
- **kwargs – Additional experimental arguments. 
 
 
 - build_metric(**kwargs)[source]¶
- Implement this to build metric(s). - Parameters
- **kwargs – The subclass decides the method signature. 
 
 
