eos

class hanlp.datasets.eos.eos.SentenceBoundaryDetectionDataset(data: Union[str, List], transform: Union[Callable, List] = None, cache=None, append_after_sentence=None, eos_chars=None, eos_char_min_freq=200, eos_char_is_punct=True, window_size=5, **kwargs)[source]

Dataset for sentence boundary detection (eos).

Parameters
  • data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

  • transform – Predefined transform(s).

  • cacheTrue to enable caching, so that transforms won’t be called twice.

  • append_after_sentence – A str to insert at the tail of each sentence. For example, English always have a space between sentences.

  • eos_chars – Punctuations at the tail of sentences. If None, then it will built from training samples.

  • eos_char_min_freq – Minimal frequency to keep a eos char.

  • eos_char_is_punct – Limit eos chars to punctuations.

  • window_size – Window size to extract ngram features.

  • kwargs – Not used.

load_file(filepath: str)[source]

Load eos corpus.

Parameters

filepath – Path to the corpus.

$ head -n 2 ctb8.txt
中国经济简讯
新华社北京十月二十九日电中国经济简讯