eos

class hanlp.components.eos.ngram.NgramSentenceBoundaryDetector(**kwargs)[source]

A sentence boundary detector using ngram as features and LSTM as encoder (Schweter & Ahmed 2019). It predicts whether a punctuation marks an EOS.

Note

This component won’t work on text without the punctuations defined in its config. It’s always recommended to understand how it works before using it. The predefined punctuations can be listed by the following codes.

>>> print(eos.config.eos_chars)
Parameters

**kwargs – Passed to config.

build_criterion(**kwargs)[source]

Implement this method to build criterion (loss function).

Parameters

**kwargs – The subclass decides the method signature.

build_dataloader(data, batch_size, shuffle, device, logger: logging.Logger, **kwargs) → torch.utils.data.dataloader.DataLoader[source]

Build dataloader for training, dev and test sets. It’s suggested to build vocabs in this method if they are not built yet.

Parameters
  • data – Data representing samples, which can be a path or a list of samples.

  • batch_size – Number of samples per batch.

  • shuffle – Whether to shuffle this dataloader.

  • device – Device tensors should be loaded onto.

  • logger – Logger for reporting some message if dataloader takes a long time or if vocabs has to be built.

  • **kwargs – Arguments from **self.config.

build_metric(**kwargs)[source]

Implement this to build metric(s).

Parameters

**kwargs – The subclass decides the method signature.

build_model(training=True, **kwargs) → torch.nn.modules.module.Module[source]

Build model.

Parameters
  • trainingTrue if called during training.

  • **kwargs**self.config.

build_optimizer(**kwargs)[source]

Implement this method to build an optimizer.

Parameters

**kwargs – The subclass decides the method signature.

build_vocabs(dataset: hanlp.datasets.eos.eos.SentenceBoundaryDetectionDataset, logger, **kwargs)[source]

Override this method to build vocabs.

Parameters
  • trn – Training set.

  • logger – Logger for reporting progress.

evaluate_dataloader(data: torch.utils.data.dataloader.DataLoader, criterion: Callable, metric, logger, ratio_width=None, output=False, **kwargs)[source]

Evaluate on a dataloader.

Parameters
  • data – Dataloader which can build from any data source.

  • criterion – Loss function.

  • metric – Metric(s).

  • output – Whether to save outputs into some file.

  • **kwargs – Not used.

execute_training_loop(trn: torch.utils.data.dataloader.DataLoader, dev: torch.utils.data.dataloader.DataLoader, epochs, criterion, optimizer, metric, save_dir, logger: logging.Logger, devices, **kwargs)[source]

Implement this to run training loop.

Parameters
  • trn – Training set.

  • dev – Development set.

  • epochs – Number of epochs.

  • criterion – Loss function.

  • optimizer – Optimizer(s).

  • metric – Metric(s)

  • save_dir – The directory to save this component.

  • logger – Logger for reporting progress.

  • devices – Devices this component and dataloader will live on.

  • ratio_width – The width of dataset size measured in number of characters. Used for logger to align messages.

  • **kwargs – Other hyper-parameters passed from sub-class.

fit(trn_data, dev_data, save_dir, epochs=5, append_after_sentence=None, eos_chars=None, eos_char_min_freq=200, eos_char_is_punct=True, char_min_freq=None, window_size=5, batch_size=32, lr=0.001, grad_norm=None, loss_reduction='sum', embedding_size=128, rnn_type: str = 'LSTM', rnn_size=256, rnn_layers=1, rnn_bidirectional=False, dropout=0.2, devices=None, logger=None, seed=None, **kwargs)[source]

Fit to data, triggers the training procedure. For training set and dev set, they shall be local or remote files.

Parameters
  • trn_data – Training set.

  • dev_data – Development set.

  • save_dir – The directory to save trained component.

  • batch_size – The number of samples in a batch.

  • epochs – Number of epochs.

  • devices – Devices this component will live on.

  • logger – Any logging.Logger instance.

  • seed – Random seed to reproduce this training.

  • finetuneTrue to load from save_dir instead of creating a randomly initialized component. str to specify a different save_dir to load from.

  • eval_trn – Evaluate training set after each update. This can slow down the training but provides a quick diagnostic for debugging.

  • _device_placeholderTrue to create a placeholder tensor which triggers PyTorch to occupy devices so other components won’t take these devices as first choices.

  • **kwargs – Hyperparameters used by sub-classes.

Returns

Any results sub-classes would like to return. Usually the best metrics on training set.

fit_dataloader(trn: torch.utils.data.dataloader.DataLoader, criterion, optimizer, metric, logger: logging.Logger, **kwargs)[source]

Fit onto a dataloader.

Parameters
  • trn – Training set.

  • criterion – Loss function.

  • optimizer – Optimizer.

  • metric – Metric(s).

  • logger – Logger for reporting progress.

  • **kwargs – Other hyper-parameters passed from sub-class.

predict(data: Union[str, List[str]], batch_size: int = None, strip=True, **kwargs)[source]

Sentence split.

Parameters
  • data – A paragraph or a list of paragraphs.

  • batch_size – Number of samples per batch.

  • strip – Strip out blank characters at the head and tail of each sentence.

Returns

A list of sentences or a list of lists of sentences.