CRFConstituencyParsing(trn: str = None, dev: str = None, tst: str = None, sampler_builder: hanlp.common.dataset.SamplerBuilder = None, dependencies: str = None, scalar_mix: hanlp.layers.scalar_mix.ScalarMixWithDropoutBuilder = None, use_raw_hidden_states=False, lr=None, separate_optimizer=False, cls_is_bos=True, sep_is_eos=True, delete=('', ':', '``', "''", '.', '?', '!', '-NONE-', 'TOP', ',', 'S1'), equal=(('ADVP', 'PRT')), mbr=True, n_mlp_span=500, n_mlp_label=100, mlp_dropout=0.33, no_subcategory=True, **kwargs)¶
Two-stage CRF Parsing (Zhang et al. 2020).
trn – Path to training set.
dev – Path to dev set.
tst – Path to test set.
sampler_builder – A builder which builds a sampler.
dependencies – Its dependencies on other tasks.
scalar_mix – A builder which builds a ScalarMixWithDropout object.
use_raw_hidden_states – Whether to use raw hidden states from transformer without any pooling.
lr – Learning rate for this task.
separate_optimizer – Use customized separate optimizer for this task.
Trueto treat the first token as
Trueto treat the last token as
delete – Constituencies to be deleted from training and evaluation.
equal – Constituencies that are regarded as equal during evaluation.
Trueto enable Minimum Bayes Risk (MBR) decoding (Smith & Smith 2007).
n_mlp_span – Number of features for span decoder.
n_mlp_label – Number of features for label decoder.
mlp_dropout – Dropout applied to MLPs.
no_subcategory – Strip out subcategories.
**kwargs – Not used.
build_dataloader(data, transform: Callable = None, training=False, device=None, logger: logging.Logger = None, cache=False, gradient_accumulation=1, **kwargs) → torch.utils.data.dataloader.DataLoader¶
Build a dataloader for training or evaluation.
data – Either a path or a list of samples.
transform – The transform from MTL, which is usually [TransformerSequenceTokenizer, FieldLength(‘token’)]
training – Whether this method is called on training set.
device – The device dataloader is intended to work with.
logger – Logger for printing message indicating progress.
cache – Whether the dataloader should be cached.
gradient_accumulation – Gradient accumulation to be passed to sampler builder.
**kwargs – Additional experimental arguments.
Implement this to build metric(s).
**kwargs – The subclass decides the method signature.
build_model(encoder_size, training=True, **kwargs) → torch.nn.modules.module.Module¶
Trueif called during training.
build_samples(inputs, cls_is_bos=False, sep_is_eos=False)¶
Build samples for this task. Called when this task is the first task. Default behaviour is to take inputs as list of tokens and put these tokens into a dict per sample.
inputs – Inputs from users, usually a list of lists of tokens.
cls_is_bos – Insert BOS to the head of each sentence.
sep_is_eos – Append EOS to the tail of each sentence.
List of samples.