mcws_dataset
mcws_dataset¶
- class hanlp.datasets.tokenization.loaders.multi_criteria_cws.mcws_dataset.MultiCriteriaTextTokenizingDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None, delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False)[source]¶
A dataset for tagging tokenization tasks.
- Parameters
data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache –
True
to enable caching, so that transforms won’t be called twice.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.delimiter – Delimiter between tokens used to split a line in the corpus.
max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.
- load_file(filepath: Union[Iterable[str], Dict[str, str]])[source]¶
Load multi-criteria corpora specified in filepath.
- Parameters
filepath – A list of files where filename is its criterion. Or a dict of filename-criterion pairs.
$ tree -L 2 . . ├── cnc │ ├── dev.txt │ ├── test.txt │ ├── train-all.txt │ └── train.txt ├── ctb │ ├── dev.txt │ ├── test.txt │ ├── train-all.txt │ └── train.txt ├── sxu │ ├── dev.txt │ ├── test.txt │ ├── train-all.txt │ └── train.txt ├── udc │ ├── dev.txt │ ├── test.txt │ ├── train-all.txt │ └── train.txt ├── wtb │ ├── dev.txt │ ├── test.txt │ ├── train-all.txt │ └── train.txt └── zx ├── dev.txt ├── test.txt ├── train-all.txt └── train.txt $ head -n 2 ctb/dev.txt 上海 浦东 开发 与 法制 建设 同步 新华社 上海 二月 十日 电 ( 记者 谢金虎 、 张持坚 )