tsv¶

class hanlp.datasets.ner.loaders.tsv.TSVTaggingDataset(data: Union[str, List], transform: = None, cache=None, generate_idx=None, max_seq_len=None, sent_delimiter=None, char_level=False, hard_constraint=False, **kwargs)[source]
Parameters
• data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

• transform – Predefined transform(s).

• cacheTrue to enable caching, so that transforms won’t be called twice.

• generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

• max_seq_len – Sentences longer than max_seq_len will be split into shorter ones if possible.

• sent_delimiter – Delimiter between sentences, like period or comma, which indicates a long sentence can be split here.

• char_level – Whether the sequence length is measured at char level, which is never the case for lemmatization.

• hard_constraint – Whether to enforce hard length constraint on sentences. If there is no sent_delimiter in a sentence, it will be split at a token anyway.

• kwargs – Not used.

Load a .tsv file. A .tsv file for tagging is defined as a tab separated text file, where non-empty lines have two columns for token and tag respectively, empty lines mark the end of sentences.

Parameters

filepath – Path to a .tsv tagging file.

\$ head eng.train.tsv
-DOCSTART-      O

EU      S-ORG
rejects O
German  S-MISC
call    O
to      O
boycott O
British S-MISC
lamb    O