dataset

This module provides base definition for datasets, dataloaders and samplers.

datasets

class hanlp.common.dataset.Transformable(transform: Union[Callable, List] = None)[source]

An object which can be transformed with a list of functions. It can be treated as an objected being passed through a list of functions, while these functions are kept in a list.

Parameters

transform – A transform function or a list of functions.

append_transform(transform: Callable)[source]

Append a transform to its list of transforms.

Parameters

transform – A new transform to be appended.

Returns

Itself.

insert_transform(index: int, transform: Callable)[source]

Insert a transform to a certain position.

Parameters
  • index – A certain position.

  • transform – A new transform.

Returns

Itself.

transform_sample(sample: dict, inplace=False)dict[source]

Apply transforms to a sample.

Parameters
  • sample – A sample, which is a dict holding features.

  • inplaceTrue to apply transforms inplace.

Attention

If any transform modifies existing features, it will modify again and again when inplace=True. For example, if a transform insert a BOS token to a list inplace, and it is called twice, then 2 BOS will be inserted which might not be an intended result.

Returns

Transformed sample.

class hanlp.common.dataset.TransformableDataset(data: Union[str, List], transform: Union[Callable, List] = None, cache=None, generate_idx=None)[source]

A Dataset which can be applied with a list of transform functions.

Parameters
  • data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

  • transform – Predefined transform(s).

  • cacheTrue to enable caching, so that transforms won’t be called twice.

  • generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

__getitem__(index: Union[int, slice]) → Union[dict, List[dict]][source]

Get the index-th sample in this dataset.

Parameters

index – Either a integer index of a list of indices.

Returns: Either a sample or or list of samples depending on how many indices are passed in.

k_fold(k, i)[source]

Perform k-fold sampling.

Parameters
  • k (int) – Number of folds.

  • i (int) – The i-th fold.

Returns

The i-th fold subset of this dataset.

Return type

TransformableDataset

load_data(data, generate_idx=False)[source]

A intermediate step between constructor and calling the actual file loading method.

Parameters
  • data – If data is a file, this method calls load_file() to load it.

  • generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

Returns: Loaded samples.

abstract load_file(filepath: str)[source]

The actual file loading logic.

Parameters

filepath – The path to a dataset.

prune(criterion: Callable, logger: logging.Logger = None)[source]

Prune (to discard) samples according to a criterion.

Parameters
  • criterion – A functions takes a sample as input and output True if the sample needs to be pruned.

  • logger – If any, log statistical messages using it.

Returns

Size before pruning.

Return type

int

purge_cache()[source]

Purges all cache. If cache is not enabled, this method enables it.

should_load_file(data)bool[source]

Determines whether data is a filepath.

Parameters

data – Data to check.

Returns: True to indicate it’s a filepath.

shuffle()[source]

Shuffle this dataset inplace.

split(*ratios)[source]

Split dataset into subsets.

Parameters

*ratios – The ratios for each subset. They can be any type of numbers which will be normalized. For example, 8, 1, 1 are equivalent to 0.8, 0.1, 0.1.

Returns

A list of subsets.

Return type

list[TransformableDataset]

subset(indices)[source]

Create a subset given indices of samples.

Parameters

indices – Indices of samples.

Returns

The a subset of this dataset.

Return type

TransformableDataset

dataloaders

class hanlp.common.dataset.PadSequenceDataLoader(dataset, batch_size=32, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, pad: dict = None, vocabs: hanlp.common.transform.VocabDict = None, device=None, **kwargs)[source]

A dataloader commonly used for NLP tasks. It offers the following convenience.

  • Bachify each field of samples into a Tensor if the field name satisfies the following criterion.
    • Name ends with _id, _ids, _count, _offset, _span, mask

    • Name is in pad dict.

  • Pad each field according to field name, the vocabs and pad dict.

  • Move Tensor onto device.

Parameters
  • dataset – A Dataset to be bachified.

  • batch_size – Max size of each batch.

  • shuffleTrue to shuffle batches.

  • sampler – A Sampler to sample samples from data.

  • batch_sampler – A Sampler to sample batches form all batches.

  • num_workers – Number of workers for multi-thread loading. Note that multi-thread loading aren’t always faster.

  • collate_fn – A function to perform batchifying. It must be set to None in order to make use of the features this class offers.

  • pin_memory – If samples are loaded in the Dataset on CPU and would like to be pushed to the GPU, enabling pin_memory can speed up the transfer. It’s not useful since most data field are not in Tensor type.

  • drop_last – Drop the last batch since it could be half-empty.

  • timeout – For multi-worker loading, set a timeout to wait for a worker.

  • worker_init_fn – Init function for multi-worker.

  • multiprocessing_context – Context for multiprocessing.

  • pad – A dict holding field names and their padding values.

  • vocabs – A dict of vocabs so padding value can be fetched from it.

  • device – The device tensors will be moved onto.

  • **kwargs – Other arguments will be passed to torch.utils.data.Dataset.__init__()

static pad_data(data: Union[torch.Tensor, Iterable], pad, dtype=None, device=None)[source]

Perform the actual padding for a given data.

Parameters
  • data – Data to be padded.

  • pad – Padding value.

  • dtype – Data type.

  • device – Device to be moved onto.

Returns

A torch.Tensor.

Return type

torch.Tensor

class hanlp.common.dataset.PrefetchDataLoader(dataloader: torch.utils.data.dataloader.DataLoader, prefetch: int = 10, batchify: Callable = None)[source]

A dataloader wrapper which speeds up bachifying using multi-processing. It works best for dataloaders of which the bachify takes very long time. But it introduces extra GPU memory consumption since prefetched batches are stored in a Queue on GPU.

Caution

PrefetchDataLoader only works in spawn mode with the following initialization code:

Examples:

if __name__ == '__main__':
    import torch

    torch.multiprocessing.set_start_method('spawn')

And these 2 lines MUST be put into if __name__ == '__main__': block.

Parameters
  • dataloader – A DatasetLoader to be prefetched.

  • prefetch – Number of batches to prefetch.

  • batchify – A bachify function called on each batch of samples. In which case, the inner dataloader shall return samples without really bachify them.

close()[source]

Close this dataloader and terminates internal processes and queue. It’s recommended to call this method to ensure a program can gracefully shutdown.

samplers

class hanlp.common.dataset.BucketSampler(buckets: Dict[float, List[int]], batch_max_tokens, batch_size=None, shuffle=False)[source]

A bucketing based sampler which groups samples into buckets then creates batches from each bucket.

Parameters
  • buckets – A dict of which keys are some statistical numbers of each bucket, and values are the indices of samples in each bucket.

  • batch_max_tokens – Maximum tokens per batch.

  • batch_size – Maximum samples per batch.

  • shuffleTrue to shuffle batches and samples in a batch.

class hanlp.common.dataset.KMeansSampler(lengths, batch_max_tokens, batch_size=None, shuffle=False, n_buckets=1)[source]

A bucket sampler which groups samples using KMeans on their lengths.

Parameters
  • lengths – Lengths of each sample, usually measured by number of tokens.

  • batch_max_tokens – Maximum tokens per batch.

  • batch_size – Maximum samples per batch.

  • shuffleTrue to shuffle batches. Samples in the same batch won’t be shuffled since the ordered sequence is helpful to speed up RNNs.

  • n_buckets – Number of buckets. Clusters in terms of KMeans.

class hanlp.common.dataset.SortingSampler(lengths: List[int], batch_size=None, batch_max_tokens=None, use_effective_tokens=False, shuffle=False)[source]

A sampler which sort samples according to their lengths. It takes a continuous chunk of sorted samples to make a batch.

Parameters
  • lengths – Lengths of each sample, usually measured by number of tokens.

  • batch_max_tokens – Maximum tokens per batch.

  • use_effective_tokens – Whether to calculate effective number of tokens when applying the batch_max_tokens.

  • batch_size – Maximum samples per batch.

  • shuffleTrue to shuffle batches and samples in a batch.

sampler builders

class hanlp.common.dataset.SamplerBuilder[source]
abstract build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
  • lengths – The lengths of samples.

  • shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

  • gradient_accumulation – Number of mini-batches per update step.

  • **kwargs – Other arguments to be passed to the constructor of the sampler.

scale(gradient_accumulation)[source]

Scale down the batch_size and batch_max_tokens to \(\frac{1}{\text{gradient_accumulation}}\) of them respectively.

Parameters

gradient_accumulation – Number of mini-batches per update step.

Returns

batch_size, batch_max_tokens

Return type

tuple(int,int)

class hanlp.common.dataset.SortingSamplerBuilder(batch_size=None, batch_max_tokens=None, use_effective_tokens=False)[source]

Builds a SortingSampler.

Parameters
  • batch_max_tokens – Maximum tokens per batch.

  • use_effective_tokens – Whether to calculate effective number of tokens when applying the batch_max_tokens.

  • batch_size – Maximum samples per batch.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
  • lengths – The lengths of samples.

  • shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

  • gradient_accumulation – Number of mini-batches per update step.

  • **kwargs – Other arguments to be passed to the constructor of the sampler.

class hanlp.common.dataset.KMeansSamplerBuilder(batch_max_tokens, batch_size=None, n_buckets=1)[source]

Builds a KMeansSampler.

Parameters
  • batch_max_tokens – Maximum tokens per batch.

  • batch_size – Maximum samples per batch.

  • n_buckets – Number of buckets. Clusters in terms of KMeans.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
  • lengths – The lengths of samples.

  • shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

  • gradient_accumulation – Number of mini-batches per update step.

  • **kwargs – Other arguments to be passed to the constructor of the sampler.