# dataset¶

This module provides base definition for datasets, dataloaders and samplers.

## datasets¶

class hanlp.common.dataset.Transformable(transform: = None)[source]

An object which can be transformed with a list of functions. It is the final result of an object being passed through a list of functions, while these functions are kept in a list.

Parameters

transform – A transform function or a list of functions.

append_transform(transform: Callable)[source]

Append a transform to its list of transforms.

Parameters

transform – A new transform to be appended.

Returns

Itself.

insert_transform(index: int, transform: Callable)[source]

Insert a transform to a certain position.

Parameters
• index – A certain position.

• transform – A new transform.

Returns

Itself.

transform_sample(sample: dict, inplace=False) dict[source]

Apply transforms to a sample.

Parameters
• sample – A sample, which is a dict holding features.

• inplaceTrue to apply transforms inplace.

Attention

If any transform modifies existing features, it will modify again and again when inplace=True. For example, if a transform insert a BOS token to a list inplace, and it is called twice, then 2 BOS will be inserted which might not be an intended result.

Returns

Transformed sample.

class hanlp.common.dataset.TransformableDataset(data: Union[str, List], transform: = None, cache=None, generate_idx=None)[source]

A Dataset which can be applied with a list of transform functions.

Parameters
• data – The local or remote path to a dataset, or a list of samples where each sample is a dict.

• transform – Predefined transform(s).

• cacheTrue to enable caching, so that transforms won’t be called twice.

• generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

__getitem__(index: ) Union[dict, List[dict]][source]

Get the index-th sample in this dataset.

Parameters

index – Either a integer index of a list of indices.

Returns: Either a sample or or list of samples depending on how many indices are passed in.

k_fold(k, i)[source]

Perform k-fold sampling.

Parameters
• k (int) – Number of folds.

• i (int) – The i-th fold.

Returns

The i-th fold subset of this dataset.

Return type

TransformableDataset

A intermediate step between constructor and calling the actual file loading method.

Parameters
• data – If data is a file, this method calls load_file() to load it.

• generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

Parameters

filepath – The path to a dataset.

prune(criterion: Callable, logger: = None)[source]

Prune (to discard) samples according to a criterion.

Parameters
• criterion – A functions takes a sample as input and output True if the sample needs to be pruned.

• logger – If any, log statistical messages using it.

Returns

Size before pruning.

Return type

int

purge_cache()[source]

Purges all cache. If cache is not enabled, this method enables it.

Determines whether data is a filepath.

Parameters

data – Data to check.

Returns: True to indicate it’s a filepath.

shuffle()[source]

Shuffle this dataset inplace.

split(*ratios)[source]

Split dataset into subsets.

Parameters

*ratios – The ratios for each subset. They can be any type of numbers which will be normalized. For example, 8, 1, 1 are equivalent to 0.8, 0.1, 0.1.

Returns

A list of subsets.

Return type
subset(indices)[source]

Create a subset given indices of samples.

Parameters

indices – Indices of samples.

Returns

The a subset of this dataset.

Return type

TransformableDataset

class hanlp.common.dataset.PadSequenceDataLoader(dataset, batch_size=32, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, pad: = None, vocabs: = None, device=None, **kwargs)[source]

A dataloader commonly used for NLP tasks. It offers the following convenience.

• Bachify each field of samples into a Tensor if the field name satisfies the following criterion.
• Name ends with _id, _ids, _count, _offset, _span, mask

• Name is in pad dict.

• Pad each field according to field name, the vocabs and pad dict.

• Move Tensor onto device.

Parameters
• dataset – A Dataset to be bachified.

• batch_size – Max size of each batch.

• shuffleTrue to shuffle batches.

• sampler – A Sampler to sample samples from data.

• batch_sampler – A Sampler to sample batches form all batches.

• collate_fn – A function to perform batchifying. It must be set to None in order to make use of the features this class offers.

• pin_memory – If samples are loaded in the Dataset on CPU and would like to be pushed to the GPU, enabling pin_memory can speed up the transfer. It’s not useful since most data field are not in Tensor type.

• drop_last – Drop the last batch since it could be half-empty.

• timeout – For multi-worker loading, set a timeout to wait for a worker.

• worker_init_fn – Init function for multi-worker.

• multiprocessing_context – Context for multiprocessing.

• pad – A dict holding field names and their padding values.

• vocabs – A dict of vocabs so padding value can be fetched from it.

• device – The device tensors will be moved onto.

• **kwargs – Other arguments will be passed to torch.utils.data.Dataset.__init__()

Perform the actual padding for a given data.

Parameters
• data – Data to be padded.

• dtype – Data type.

• device – Device to be moved onto.

Returns

A torch.Tensor.

Return type

torch.Tensor

A dataloader wrapper which speeds up bachifying using multi-processing. It works best for dataloaders of which the bachify takes very long time. But it introduces extra GPU memory consumption since prefetched batches are stored in a Queue on GPU.

Caution

PrefetchDataLoader only works in spawn mode with the following initialization code:

Examples:

if __name__ == '__main__':
import torch

torch.multiprocessing.set_start_method('spawn')


And these 2 lines MUST be put into if __name__ == '__main__': block.

Parameters
• dataloader – A DatasetLoader to be prefetched.

• prefetch – Number of batches to prefetch.

• batchify – A bachify function called on each batch of samples. In which case, the inner dataloader shall return samples without really bachify them.

close()[source]

Close this dataloader and terminates internal processes and queue. It’s recommended to call this method to ensure a program can gracefully shutdown.

## samplers¶

class hanlp.common.dataset.BucketSampler(buckets: Dict[float, List[int]], batch_max_tokens, batch_size=None, shuffle=False)[source]

A bucketing based sampler which groups samples into buckets then creates batches from each bucket.

Parameters
• buckets – A dict of which keys are some statistical numbers of each bucket, and values are the indices of samples in each bucket.

• batch_max_tokens – Maximum tokens per batch.

• batch_size – Maximum samples per batch.

• shuffleTrue to shuffle batches and samples in a batch.

class hanlp.common.dataset.KMeansSampler(lengths, batch_max_tokens, batch_size=None, shuffle=False, n_buckets=1)[source]

A bucket sampler which groups samples using KMeans on their lengths.

Parameters
• lengths – Lengths of each sample, usually measured by number of tokens.

• batch_max_tokens – Maximum tokens per batch.

• batch_size – Maximum samples per batch.

• shuffleTrue to shuffle batches. Samples in the same batch won’t be shuffled since the ordered sequence is helpful to speed up RNNs.

• n_buckets – Number of buckets. Clusters in terms of KMeans.

class hanlp.common.dataset.SortingSampler(lengths: List[int], batch_size=None, batch_max_tokens=None, use_effective_tokens=False, shuffle=False)[source]

A sampler which sorts samples according to their lengths. It takes a continuous chunk of sorted samples to make a batch. The effective batch size is determined by batch_size, batch_max_tokens and use_effective_tokens.

Parameters
• lengths – Lengths of each sample, usually measured by number of tokens.

• batch_max_tokens – Maximum tokens per batch.

• use_effective_tokens – Whether to calculate the effective number of tokens after padding when applying the batch_max_tokens.

• batch_size – Maximum samples per batch.

• shuffleTrue to shuffle batches and samples in a batch.

## sampler builders¶

class hanlp.common.dataset.SamplerBuilder[source]
abstract build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
• lengths – The lengths of samples.

• shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

• gradient_accumulation – Number of mini-batches per update step.

• **kwargs – Other arguments to be passed to the constructor of the sampler.

Scale down the batch_size and batch_max_tokens to $$\frac{1}{\text{gradient_accumulation}}$$ of them respectively.

Parameters

gradient_accumulation – Number of mini-batches per update step.

Returns

batch_size, batch_max_tokens

Return type

tuple(int,int)

class hanlp.common.dataset.SortingSamplerBuilder(batch_size=None, batch_max_tokens=None, use_effective_tokens=False)[source]

Builds a SortingSampler.

Parameters
• batch_max_tokens – Maximum tokens per batch.

• use_effective_tokens – Whether to calculate effective number of tokens when applying the batch_max_tokens.

• batch_size – Maximum samples per batch.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
• lengths – The lengths of samples.

• shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

• gradient_accumulation – Number of mini-batches per update step.

• **kwargs – Other arguments to be passed to the constructor of the sampler.

class hanlp.common.dataset.KMeansSamplerBuilder(batch_max_tokens, batch_size=None, n_buckets=1)[source]

Builds a KMeansSampler.

Parameters
• batch_max_tokens – Maximum tokens per batch.

• batch_size – Maximum samples per batch.

• n_buckets – Number of buckets. Clusters in terms of KMeans.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler[source]

Build a Sampler given statistics of samples and other arguments.

Parameters
• lengths – The lengths of samples.

• shuffleTrue to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.

• gradient_accumulation – Number of mini-batches per update step.

• **kwargs – Other arguments to be passed to the constructor of the sampler.