eos
eos¶
- class hanlp.components.eos.ngram.NgramSentenceBoundaryDetector(**kwargs)[source]¶
A sentence boundary detector using ngram as features and LSTM as encoder (Schweter & Ahmed 2019). It predicts whether a punctuation marks an
EOS
.Note
This component won’t work on text without the punctuations defined in its config. It’s always recommended to understand how it works before using it. The predefined punctuations can be listed by the following codes.
>>> print(eos.config.eos_chars)
- Parameters
**kwargs – Passed to config.
- build_criterion(**kwargs)[source]¶
Implement this method to build criterion (loss function).
- Parameters
**kwargs – The subclass decides the method signature.
- build_dataloader(data, batch_size, shuffle, device, logger: logging.Logger, **kwargs) torch.utils.data.dataloader.DataLoader [source]¶
Build dataloader for training, dev and test sets. It’s suggested to build vocabs in this method if they are not built yet.
- Parameters
data – Data representing samples, which can be a path or a list of samples.
batch_size – Number of samples per batch.
shuffle – Whether to shuffle this dataloader.
device – Device tensors should be loaded onto.
logger – Logger for reporting some message if dataloader takes a long time or if vocabs has to be built.
**kwargs – Arguments from
**self.config
.
- build_metric(**kwargs)[source]¶
Implement this to build metric(s).
- Parameters
**kwargs – The subclass decides the method signature.
- build_model(training=True, **kwargs) torch.nn.modules.module.Module [source]¶
Build model.
- Parameters
training –
True
if called during training.**kwargs –
**self.config
.
- build_optimizer(**kwargs)[source]¶
Implement this method to build an optimizer.
- Parameters
**kwargs – The subclass decides the method signature.
- build_vocabs(dataset: hanlp.datasets.eos.eos.SentenceBoundaryDetectionDataset, logger, **kwargs)[source]¶
Override this method to build vocabs.
- Parameters
trn – Training set.
logger – Logger for reporting progress.
- evaluate_dataloader(data: torch.utils.data.dataloader.DataLoader, criterion: Callable, metric, logger, ratio_width=None, output=False, **kwargs)[source]¶
Evaluate on a dataloader.
- Parameters
data – Dataloader which can build from any data source.
criterion – Loss function.
metric – Metric(s).
output – Whether to save outputs into some file.
**kwargs – Not used.
- execute_training_loop(trn: torch.utils.data.dataloader.DataLoader, dev: torch.utils.data.dataloader.DataLoader, epochs, criterion, optimizer, metric, save_dir, logger: logging.Logger, devices, **kwargs)[source]¶
Implement this to run training loop.
- Parameters
trn – Training set.
dev – Development set.
epochs – Number of epochs.
criterion – Loss function.
optimizer – Optimizer(s).
metric – Metric(s)
save_dir – The directory to save this component.
logger – Logger for reporting progress.
devices – Devices this component and dataloader will live on.
ratio_width – The width of dataset size measured in number of characters. Used for logger to align messages.
**kwargs – Other hyper-parameters passed from sub-class.
- fit(trn_data, dev_data, save_dir, epochs=5, append_after_sentence=None, eos_chars=None, eos_char_min_freq=200, eos_char_is_punct=True, char_min_freq=None, window_size=5, batch_size=32, lr=0.001, grad_norm=None, loss_reduction='sum', embedding_size=128, rnn_type: str = 'LSTM', rnn_size=256, rnn_layers=1, rnn_bidirectional=False, dropout=0.2, devices=None, logger=None, seed=None, **kwargs)[source]¶
Fit to data, triggers the training procedure. For training set and dev set, they shall be local or remote files.
- Parameters
trn_data – Training set.
dev_data – Development set.
save_dir – The directory to save trained component.
batch_size – The number of samples in a batch.
epochs – Number of epochs.
devices – Devices this component will live on.
logger – Any
logging.Logger
instance.seed – Random seed to reproduce this training.
finetune –
True
to load fromsave_dir
instead of creating a randomly initialized component.str
to specify a differentsave_dir
to load from.eval_trn – Evaluate training set after each update. This can slow down the training but provides a quick diagnostic for debugging.
_device_placeholder –
True
to create a placeholder tensor which triggers PyTorch to occupy devices so other components won’t take these devices as first choices.**kwargs – Hyperparameters used by sub-classes.
- Returns
Any results sub-classes would like to return. Usually the best metrics on training set.
- fit_dataloader(trn: torch.utils.data.dataloader.DataLoader, criterion, optimizer, metric, logger: logging.Logger, **kwargs)[source]¶
Fit onto a dataloader.
- Parameters
trn – Training set.
criterion – Loss function.
optimizer – Optimizer.
metric – Metric(s).
logger – Logger for reporting progress.
**kwargs – Other hyper-parameters passed from sub-class.
- predict(data: Union[str, List[str]], batch_size: Optional[int] = None, strip=True, **kwargs)[source]¶
Sentence split.
- Parameters
data – A paragraph or a list of paragraphs.
batch_size – Number of samples per batch.
strip – Strip out blank characters at the head and tail of each sentence.
- Returns
A list of sentences or a list of lists of sentences.