eos¶

class hanlp.datasets.eos.eos.SentenceBoundaryDetectionDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, append_after_sentence=None, eos_chars=None, eos_char_min_freq=200, eos_char_is_punct=True, window_size=5, **kwargs)[source]¶

Dataset for sentence boundary detection (eos).

Parameters

data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache – True to enable caching, so that transforms won’t be called twice.
append_after_sentence – A str to insert at the tail of each sentence. For example, English always have a space between sentences.
eos_chars – Punctuations at the tail of sentences. If None, then it will built from training samples.
eos_char_min_freq – Minimal frequency to keep a eos char.
eos_char_is_punct – Limit eos chars to punctuations.
window_size – Window size to extract ngram features.
kwargs – Not used.

load_file(filepath: str)[source]¶

Load eos corpus.

Parameters: filepath – Path to the corpus.

$ head -n 2 ctb8.txt
中国经济简讯
新华社北京十月二十九日电中国经济简讯

HanLP Documentation

eos

eos¶