crf_constituency_parser

Biaffine dependency parser.

class hanlp.components.parsers.constituency.crf_constituency_parser.CRFConstituencyParser(**kwargs)[source]

Two-stage CRF Parsing (Zhang et al. 2020).

Parameters

**kwargs – Predefined config.

build_criterion(decoder=None, **kwargs)[source]

Implement this method to build criterion (loss function).

Parameters

**kwargs – The subclass decides the method signature.

build_dataloader(data, batch_size, sampler_builder: hanlp.common.dataset.SamplerBuilder = None, gradient_accumulation=1, shuffle=False, device=None, logger: logging.Logger = None, **kwargs) → torch.utils.data.dataloader.DataLoader[source]

Build dataloader for training, dev and test sets. It’s suggested to build vocabs in this method if they are not built yet.

Parameters
  • data – Data representing samples, which can be a path or a list of samples.

  • batch_size – Number of samples per batch.

  • shuffle – Whether to shuffle this dataloader.

  • device – Device tensors should be loaded onto.

  • logger – Logger for reporting some message if dataloader takes a long time or if vocabs has to be built.

  • **kwargs – Arguments from **self.config.

build_metric(**kwargs)[source]

Implement this to build metric(s).

Parameters

**kwargs – The subclass decides the method signature.

build_model(encoder, training=True, **kwargs) → torch.nn.modules.module.Module[source]

Build model.

Parameters
  • trainingTrue if called during training.

  • **kwargs**self.config.

build_optimizer(trn, **kwargs)[source]

Implement this method to build an optimizer.

Parameters

**kwargs – The subclass decides the method signature.

build_vocabs(trn, logger, **kwargs)[source]

Override this method to build vocabs.

Parameters
  • trn – Training set.

  • logger – Logger for reporting progress.

evaluate_dataloader(data, criterion, logger=None, ratio_width=None, metric=None, output=None, **kwargs)[source]

Evaluate on a dataloader.

Parameters
  • data – Dataloader which can build from any data source.

  • criterion – Loss function.

  • metric – Metric(s).

  • output – Whether to save outputs into some file.

  • **kwargs – Not used.

execute_training_loop(trn: torch.utils.data.dataloader.DataLoader, dev: torch.utils.data.dataloader.DataLoader, epochs, criterion, optimizer, metric, save_dir, logger: logging.Logger, devices, ratio_width=None, patience=0.5, eval_trn=True, **kwargs)[source]

Implement this to run training loop.

Parameters
  • trn – Training set.

  • dev – Development set.

  • epochs – Number of epochs.

  • criterion – Loss function.

  • optimizer – Optimizer(s).

  • metric – Metric(s)

  • save_dir – The directory to save this component.

  • logger – Logger for reporting progress.

  • devices – Devices this component and dataloader will live on.

  • ratio_width – The width of dataset size measured in number of characters. Used for logger to align messages.

  • **kwargs – Other hyper-parameters passed from sub-class.

fit(trn_data, dev_data, save_dir, encoder, lr=5e-05, transformer_lr=None, adam_epsilon=1e-08, weight_decay=0, warmup_steps=0.1, grad_norm=1.0, n_mlp_span=500, n_mlp_label=100, mlp_dropout=0.33, batch_size=None, batch_max_tokens=5000, gradient_accumulation=1, epochs=30, patience=0.5, mbr=True, sampler_builder=None, delete=('', ':', '``', "''", '.', '?', '!', '-NONE-', 'TOP', ',', 'S1'), equal=(('ADVP', 'PRT')), no_subcategory=True, eval_trn=True, transform=None, devices=None, logger=None, seed=None, **kwargs)[source]

Fit to data, triggers the training procedure. For training set and dev set, they shall be local or remote files.

Parameters
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • batch_size – The number of samples in a batch.

  • epochs – Number of epochs.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • finetuneTrue to load from save_dir instead of creating a randomly initialized component. str to specify a different save_dir to load from.

  • eval_trn – Evaluate training set after each update. This can slow down the training but provides a quick diagnostic for debugging.

  • _device_placeholderTrue to create a placeholder tensor which triggers PyTorch to occupy devices so other components won’t take these devices as first choices.

  • **kwargs – Hyperparameters used by sub-classes.

Returns

Any results sub-classes would like to return. Usually the best metrics on training set.

fit_dataloader(trn: torch.utils.data.dataloader.DataLoader, criterion, optimizer, metric: hanlp.metrics.parsing.span.SpanMetric, logger: logging.Logger, history: hanlp.common.structure.History, gradient_accumulation=1, grad_norm=None, ratio_width=None, eval_trn=True, **kwargs)[source]

Fit onto a dataloader.

Parameters
  • trn – Training set.

  • criterion – Loss function.

  • optimizer – Optimizer.

  • metric – Metric(s).

  • logger – Logger for reporting progress.

  • **kwargs – Other hyper-parameters passed from sub-class.

predict(data: Union[str, List[str]], batch_size: int = None, **kwargs)[source]

Predict on data fed by user. Users shall avoid directly call this method since it is not guarded with torch.no_grad and will introduces unnecessary gradient computation. Use __call__ instead.

Parameters
  • data – Sentences or tokens.

  • batch_size – Decoding batch size.

  • **kwargs – Used in sub-classes.