mcws_dataset

mcws_dataset

class hanlp.datasets.tokenization.loaders.multi_criteria_cws.mcws_dataset.MultiCriteriaTextTokenizingDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None, delimiter=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False)[source]

A dataset for tagging tokenization tasks.

Parameters
  • data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

  • transform – Predefined transform(s).

  • cacheTrue to enable caching, so that transforms won’t be called twice.

  • generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

  • delimiter – Delimiter between tokens used to split a line in the corpus.

  • max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

  • sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

  • char_level – Whether the sequence length is measured at char level.

  • hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

load_file(filepath: Union[Iterable[str], Dict[str, str]])[source]

Load multi-criteria corpora specified in filepath.

Parameters

filepath – A list of files where filename is its criterion. Or a dict of filename-criterion pairs.

$ tree -L 2 .
.
├── cnc
│   ├── dev.txt
│   ├── test.txt
│   ├── train-all.txt
│   └── train.txt
├── ctb
│   ├── dev.txt
│   ├── test.txt
│   ├── train-all.txt
│   └── train.txt
├── sxu
│   ├── dev.txt
│   ├── test.txt
│   ├── train-all.txt
│   └── train.txt
├── udc
│   ├── dev.txt
│   ├── test.txt
│   ├── train-all.txt
│   └── train.txt
├── wtb
│   ├── dev.txt
│   ├── test.txt
│   ├── train-all.txt
│   └── train.txt
└── zx
    ├── dev.txt
    ├── test.txt
    ├── train-all.txt
    └── train.txt

$ head -n 2 ctb/dev.txt
上海 浦东 开发 与 法制 建设 同步
新华社 上海 二月 十日 电 ( 记者 谢金虎 、 张持坚 )
should_load_file(data) bool[source]

Determines whether data is a filepath.

Parameters

data – Data to check.

Returns: True to indicate it’s a filepath.