json

class hanlp.datasets.ner.json_ner.JsonNERDataset(data: Union[str, List], transform: Union[Callable, List] = None, cache=None, generate_idx=None, doc_level_offset=True, tagset=None)[source]

A dataset for .jsonlines format NER corpora.

Parameters
  • data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

  • transform – Predefined transform(s).

  • cacheTrue to enable caching, so that transforms won’t be called twice.

  • generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

  • doc_level_offsetTrue to indicate the offsets in jsonlines are of document level.

  • tagset – Optional tagset to prune entities outside of this tagset from datasets.

load_file(filepath: str)[source]

Load .jsonlines NER corpus. Samples of this corpus can be found using the following scripts.

import json
from hanlp_common.document import Document
from hanlp.datasets.srl.ontonotes5.chinese import ONTONOTES5_CONLL12_CHINESE_DEV
from hanlp.utils.io_util import get_resource

with open(get_resource(ONTONOTES5_CONLL12_CHINESE_DEV)) as src:
    for line in src:
        doc = json.loads(line)
        print(Document(doc))
        break
Parameters

filepath.jsonlines NER corpus.