txt¶

class hanlp.datasets.tokenization.loaders.txt.TextTokenizingDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None, delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False)[source]¶

A dataset for tagging tokenization tasks.

Parameters

data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache – True to enable caching, so that transforms won’t be called twice.
generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.
delimiter – Delimiter between tokens used to split a line in the corpus.
max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.
sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

load_file(filepath: str)[source]¶

Load tokenized corpus. The format is one sentence per line, where each line consisits of tokens seperated by a delimiter (usually space).

$ head train.txt
上海 浦东 开发 与 法制 建设 同步
新华社 上海 二月 十日 电 （ 记者 谢金虎 、 张持坚 ）

Parameters: filepath – The path to the corpus.

HanLP Documentation

txt

txt¶