ud_parser¶
Universal Dependencies Parsing (lemmatization, features, PoS tagging and dependency parsing).
-
class
hanlp.components.parsers.ud.ud_parser.
UniversalDependenciesParser
(**kwargs)[source]¶ Universal Dependencies Parsing (lemmatization, features, PoS tagging and dependency parsing) implementation of “75 Languages, 1 Model: Parsing Universal Dependencies Universally” (Kondratyuk & Straka 2019).
- Parameters
**kwargs – Predefined config.
-
build_criterion
(**kwargs)[source]¶ Implement this method to build criterion (loss function).
- Parameters
**kwargs – The subclass decides the method signature.
-
build_dataloader
(data, batch_size, shuffle=False, device=None, logger: logging.Logger = None, sampler_builder=None, gradient_accumulation=1, transformer: hanlp.layers.embeddings.contextual_word_embedding.ContextualWordEmbedding = None, **kwargs) → torch.utils.data.dataloader.DataLoader[source]¶ Build dataloader for training, dev and test sets. It’s suggested to build vocabs in this method if they are not built yet.
- Parameters
data – Data representing samples, which can be a path or a list of samples.
batch_size – Number of samples per batch.
shuffle – Whether to shuffle this dataloader.
device – Device tensors should be loaded onto.
logger – Logger for reporting some message if dataloader takes a long time or if vocabs has to be built.
**kwargs – Arguments from
**self.config
.
-
build_metric
(**kwargs)[source]¶ Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.
-
build_model
(transformer: hanlp.layers.embeddings.contextual_word_embedding.ContextualWordEmbedding, n_mlp_arc, n_mlp_rel, mlp_dropout, mix_embedding, layer_dropout, training=True, **kwargs) → torch.nn.modules.module.Module[source]¶ Build model.
- Parameters
training –
True
if called during training.**kwargs –
**self.config
.
-
build_optimizer
(trn, **kwargs)[source]¶ Implement this method to build an optimizer.
- Parameters
**kwargs – The subclass decides the method signature.
-
build_vocabs
(trn, logger, **kwargs)[source]¶ Override this method to build vocabs.
- Parameters
trn – Training set.
logger – Logger for reporting progress.
-
evaluate_dataloader
(data: torch.utils.data.dataloader.DataLoader, criterion: Callable, metric: hanlp.metrics.mtl.MetricDict = None, output=False, logger=None, ratio_width=None, **kwargs)[source]¶ Evaluate on a dataloader.
- Parameters
data – Dataloader which can build from any data source.
criterion – Loss function.
metric – Metric(s).
output – Whether to save outputs into some file.
**kwargs – Not used.
-
execute_training_loop
(trn: torch.utils.data.dataloader.DataLoader, dev: torch.utils.data.dataloader.DataLoader, epochs, criterion, optimizer, metric, save_dir, logger: logging.Logger, devices, ratio_width=None, patience=0.5, eval_trn=True, **kwargs)[source]¶ Implement this to run training loop.
- Parameters
trn – Training set.
dev – Development set.
epochs – Number of epochs.
criterion – Loss function.
optimizer – Optimizer(s).
metric – Metric(s)
save_dir – The directory to save this component.
logger – Logger for reporting progress.
devices – Devices this component and dataloader will live on.
ratio_width – The width of dataset size measured in number of characters. Used for logger to align messages.
**kwargs – Other hyper-parameters passed from sub-class.
-
fit
(trn_data, dev_data, save_dir, transformer: hanlp.layers.embeddings.contextual_word_embedding.ContextualWordEmbedding, sampler_builder=None, mix_embedding: int = 13, layer_dropout: int = 0.1, n_mlp_arc=768, n_mlp_rel=256, mlp_dropout=0.33, lr=0.001, transformer_lr=2.5e-05, patience=0.1, batch_size=32, epochs=30, gradient_accumulation=1, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, grad_norm=1.0, tree=False, proj=False, punct=False, logger=None, verbose=True, devices: Union[float, int, List[int]] = None, **kwargs)[source]¶ Fit to data, triggers the training procedure. For training set and dev set, they shall be local or remote files.
- Parameters
trn_data – Training set.
dev_data – Development set.
save_dir – The directory to save trained component.
batch_size – The number of samples in a batch.
epochs – Number of epochs.
devices – Devices this component will live on.
logger – Any
logging.Logger
instance.seed – Random seed to reproduce this training.
finetune –
True
to load fromsave_dir
instead of creating a randomly initialized component.str
to specify a differentsave_dir
to load from.eval_trn – Evaluate training set after each update. This can slow down the training but provides a quick diagnostic for debugging.
_device_placeholder –
True
to create a placeholder tensor which triggers PyTorch to occupy devices so other components won’t take these devices as first choices.**kwargs – Hyperparameters used by sub-classes.
- Returns
Any results sub-classes would like to return. Usually the best metrics on training set.
-
fit_dataloader
(trn: torch.utils.data.dataloader.DataLoader, criterion, optimizer, metric: hanlp.metrics.mtl.MetricDict, logger: logging.Logger, history: hanlp.common.structure.History, gradient_accumulation=1, grad_norm=None, ratio_width=None, eval_trn=True, **kwargs)[source]¶ Fit onto a dataloader.
- Parameters
trn – Training set.
criterion – Loss function.
optimizer – Optimizer.
metric – Metric(s).
logger – Logger for reporting progress.
**kwargs – Other hyper-parameters passed from sub-class.
-
predict
(data: Union[List[str], List[List[str]]], batch_size: int = None, **kwargs)[source]¶ Predict on data fed by user. Users shall avoid directly call this method since it is not guarded with
torch.no_grad
and will introduces unnecessary gradient computation. Use__call__
instead.- Parameters
data – Sentences or tokens.
batch_size – Decoding batch size.
**kwargs – Used in sub-classes.