SentenceBoundaryDetectionDataset(data: Union[str, List], transform: Union[Callable, List] = None, cache=None, append_after_sentence=None, eos_chars=None, eos_char_min_freq=200, eos_char_is_punct=True, window_size=5, **kwargs)¶
Dataset for sentence boundary detection (eos).
data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
Trueto enable caching, so that transforms won’t be called twice.
append_after_sentence – A
strto insert at the tail of each sentence. For example, English always have a space between sentences.
eos_chars – Punctuations at the tail of sentences. If
None, then it will built from training samples.
eos_char_min_freq – Minimal frequency to keep a eos char.
eos_char_is_punct – Limit eos chars to punctuations.
window_size – Window size to extract ngram features.
kwargs – Not used.