tsv
tsv¶
- class hanlp.datasets.ner.loaders.tsv.TSVTaggingDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, **kwargs)[source]¶
- Parameters
data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache –
True
to enable caching, so that transforms won’t be called twice.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.max_seq_len – Sentences longer than
max_seq_len
will be split into shorter ones if possible.sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.
char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.
hard_constraint – Whether to enforce hard length constraint on sentences. If there is no
sent_delimiter
in a sentence, it will be split at a token anyway.kwargs – Not used.
- load_file(filepath)[source]¶
Load a
.tsv
file. A.tsv
file for tagging is defined as a tab separated text file, where non-empty lines have two columns for token and tag respectively, empty lines mark the end of sentences.- Parameters
filepath – Path to a
.tsv
tagging file.
$ head eng.train.tsv -DOCSTART- O EU S-ORG rejects O German S-MISC call O to O boycott O British S-MISC lamb O