ud¶

Universal Dependencies Parsing (lemmatization, features, PoS tagging and dependency parsing).

class hanlp.components.mtl.tasks.ud.UniversalDependenciesParsing(trn: Optional[str] = None, dev: Optional[str] = None, tst: Optional[str] = None, sampler_builder: Optional[hanlp.common.dataset.SamplerBuilder] = None, dependencies: Optional[str] = None, scalar_mix: Optional[hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder] = None, use_raw_hidden_states=False, lr=None, separate_optimizer=False, cls_is_bos=True, sep_is_eos=False, n_mlp_arc=768, n_mlp_rel=256, mlp_dropout=0.33, tree=False, proj=False, punct=False, max_seq_len=None, **kwargs)[source]¶

Universal Dependencies Parsing (lemmatization, features, PoS tagging and dependency parsing) implementation of “75 Languages, 1 Model: Parsing Universal Dependencies Universally” (Kondratyuk & Straka 2019).

Parameters

trn – Path to training set.
dev – Path to dev set.
tst – Path to test set.
sampler_builder – A builder which builds a sampler.
dependencies – Its dependencies on other tasks.
scalar_mix – A builder which builds a ScalarMixWithDropout object.
use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.
lr – Learning rate for this task.
separate_optimizer – Use customized separate optimizer for this task.
cls_is_bos – True to treat the first token as BOS.
sep_is_eos – True to treat the last token as EOS.
n_mlp_arc – Number of features for arc representation.
n_mlp_rel – Number of features for rel representation.
mlp_dropout – Dropout applied to MLPs.
tree – True to enforce tree constraint.
proj – True for projective parsing.
punct – True to include punctuations in evaluation.
max_seq_len – Prune samples longer than this length. Useful for reducing GPU consumption.
**kwargs – Not used.

build_dataloader(data, transform: Optional[Callable] = None, training=False, device=None, logger: Optional[logging.Logger] = None, cache=False, gradient_accumulation=1, **kwargs) → torch.utils.data.dataloader.DataLoader[source]¶

Build a dataloader for training or evaluation.

Parameters

data – Either a path or a list of samples.
transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]
training – Whether this method is called on training set.
device – The device dataloader is intended to work with.
logger – Logger for printing message indicating progress.
cache – Whether the dataloader should be cached.
gradient_accumulation – Gradient accumulation to be passed to sampler builder.
**kwargs – Additional experimental arguments.

build_metric(**kwargs)[source]¶

Implement this to build metric(s).

Parameters: **kwargs – The subclass decides the method signature.

build_model(encoder_size, n_mlp_arc, n_mlp_rel, mlp_dropout, training=True, **kwargs) → torch.nn.modules.module.Module[source]¶

Build model.

Parameters

training – True if called during training.
**kwargs – **self.config.

input_is_flat(data) → bool[source]¶

Check whether the data is flat (meaning that it’s only a single sample, not even batched).

Returns: True to indicate the input data is flat.
Return type: bool

HanLP Documentation

ud

ud¶