dataset¶

This module provides base definition for datasets, dataloaders and samplers.

datasets¶

class hanlp.common.dataset.Transformable(transform: Optional[Union[Callable, List]] = None)[source]¶

An object which can be transformed with a list of functions. It is the final result of an object being passed through a list of functions, while these functions are kept in a list.

Parameters: transform – A transform function or a list of functions.

append_transform(transform: Callable)[source]¶

Append a transform to its list of transforms.

Parameters: transform – A new transform to be appended.
Returns: Itself.

insert_transform(index: int, transform: Callable)[source]¶

Insert a transform to a certain position.

Parameters

index – A certain position.
transform – A new transform.

Returns

Itself.

transform_sample(sample: dict, inplace=False) → dict[source]¶

Apply transforms to a sample.

Parameters

sample – A sample, which is a dict holding features.
inplace – True to apply transforms inplace.

Attention

If any transform modifies existing features, it will modify again and again when inplace=True. For example, if a transform insert a BOS token to a list inplace, and it is called twice, then 2 BOS will be inserted which might not be an intended result.

Returns: Transformed sample.

class hanlp.common.dataset.TransformableDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None)[source]¶

A Dataset which can be applied with a list of transform functions.

Parameters

data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache – True to enable caching, so that transforms won’t be called twice.
generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

__getitem__(index: Union[int, slice]) → Union[dict, List[dict]][source]¶

Get the index-th sample in this dataset.

Parameters: index – Either a integer index of a list of indices.

Returns: Either a sample or or list of samples depending on how many indices are passed in.

k_fold(k, i)[source]¶

Perform k-fold sampling.

Parameters

k (int) – Number of folds.
i (int) – The i-th fold.

Returns

The i-th fold subset of this dataset.

Return type

TransformableDataset

load_data(data, generate_idx=False)[source]¶

A intermediate step between constructor and calling the actual file loading method.

Parameters

data – If data is a file, this method calls load_file() to load it.
generate_idx – Create a IDX field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.

Returns: Loaded samples.

abstract load_file(filepath: str)[source]¶

The actual file loading logic.

Parameters: filepath – The path to a dataset.

prune(criterion: Callable, logger: Optional[logging.Logger] = None)[source]¶

Prune (to discard) samples according to a criterion.

Parameters

criterion – A functions takes a sample as input and output True if the sample needs to be pruned.
logger – If any, log statistical messages using it.

Returns

Size before pruning.

Return type

int

purge_cache()[source]¶: Purges all cache. If cache is not enabled, this method enables it.

should_load_file(data) → bool[source]¶

Determines whether data is a filepath.

Parameters: data – Data to check.

Returns: True to indicate it’s a filepath.

shuffle()[source]¶: Shuffle this dataset inplace.

split(*ratios)[source]¶

Split dataset into subsets.

Parameters: *ratios – The ratios for each subset. They can be any type of numbers which will be normalized. For example, 8, 1, 1 are equivalent to 0.8, 0.1, 0.1.
Returns: A list of subsets.
Return type: list[TransformableDataset]

subset(indices)[source]¶

Create a subset given indices of samples.

Parameters: indices – Indices of samples.
Returns: The a subset of this dataset.
Return type: TransformableDataset

dataloaders¶

class hanlp.common.dataset.PadSequenceDataLoader(dataset, batch_size=32, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, pad: Optional[dict] = None, vocabs: Optional[hanlp.common.transform.VocabDict] = None, device=None, **kwargs)[source]¶

A dataloader commonly used for NLP tasks. It offers the following convenience.

Bachify each field of samples into a Tensor if the field name satisfies the following criterion.
- Name ends with _id, _ids, _count, _offset, _span, mask
- Name is in pad dict.
Pad each field according to field name, the vocabs and pad dict.
Move Tensor onto device.

Parameters

dataset – A Dataset to be bachified.
batch_size – Max size of each batch.
shuffle – True to shuffle batches.
sampler – A Sampler to sample samples from data.
batch_sampler – A Sampler to sample batches form all batches.
num_workers – Number of workers for multi-thread loading. Note that multi-thread loading aren’t always faster.
collate_fn – A function to perform batchifying. It must be set to None in order to make use of the features this class offers.
pin_memory – If samples are loaded in the Dataset on CPU and would like to be pushed to the GPU, enabling pin_memory can speed up the transfer. It’s not useful since most data field are not in Tensor type.
drop_last – Drop the last batch since it could be half-empty.
timeout – For multi-worker loading, set a timeout to wait for a worker.
worker_init_fn – Init function for multi-worker.
multiprocessing_context – Context for multiprocessing.
pad – A dict holding field names and their padding values.
vocabs – A dict of vocabs so padding value can be fetched from it.
device – The device tensors will be moved onto.
**kwargs – Other arguments will be passed to torch.utils.data.Dataset.__init__()

static pad_data(data: Union[torch.Tensor, Iterable], pad, dtype=None, device=None)[source]¶

Perform the actual padding for a given data.

Parameters

data – Data to be padded.
pad – Padding value.
dtype – Data type.
device – Device to be moved onto.

Returns

A torch.Tensor.

Return type

torch.Tensor

class hanlp.common.dataset.PrefetchDataLoader(dataloader: torch.utils.data.dataloader.DataLoader, prefetch: int = 10, batchify: Optional[Callable] = None)[source]¶

A dataloader wrapper which speeds up bachifying using multi-processing. It works best for dataloaders of which the bachify takes very long time. But it introduces extra GPU memory consumption since prefetched batches are stored in a Queue on GPU.

Caution

PrefetchDataLoader only works in spawn mode with the following initialization code:

Examples:

if __name__ == '__main__':
    import torch

    torch.multiprocessing.set_start_method('spawn')

And these 2 lines MUST be put into if __name__ == '__main__': block.

Parameters

dataloader – A DatasetLoader to be prefetched.
prefetch – Number of batches to prefetch.
batchify – A bachify function called on each batch of samples. In which case, the inner dataloader shall return samples without really bachify them.

close()[source]¶: Close this dataloader and terminates internal processes and queue. It’s recommended to call this method to ensure a program can gracefully shutdown.

samplers¶

class hanlp.common.dataset.BucketSampler(buckets: Dict[float, List[int]], batch_max_tokens, batch_size=None, shuffle=False)[source]¶

A bucketing based sampler which groups samples into buckets then creates batches from each bucket.

Parameters

buckets – A dict of which keys are some statistical numbers of each bucket, and values are the indices of samples in each bucket.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle – True to shuffle batches and samples in a batch.

class hanlp.common.dataset.KMeansSampler(lengths, batch_max_tokens, batch_size=None, shuffle=False, n_buckets=1)[source]¶

A bucket sampler which groups samples using KMeans on their lengths.

Parameters

lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle – True to shuffle batches. Samples in the same batch won’t be shuffled since the ordered sequence is helpful to speed up RNNs.
n_buckets – Number of buckets. Clusters in terms of KMeans.

class hanlp.common.dataset.SortingSampler(lengths: List[int], batch_size=None, batch_max_tokens=None, use_effective_tokens=False, shuffle=False)[source]¶

A sampler which sorts samples according to their lengths. It takes a continuous chunk of sorted samples to make a batch. The effective batch size is determined by batch_size, batch_max_tokens and use_effective_tokens.

Parameters

lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
use_effective_tokens – Whether to calculate the effective number of tokens after padding when applying the batch_max_tokens.
batch_size – Maximum samples per batch.
shuffle – True to shuffle batches and samples in a batch.

sampler builders¶

class hanlp.common.dataset.SamplerBuilder[source]¶

abstract build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶

Build a Sampler given statistics of samples and other arguments.

Parameters

lengths – The lengths of samples.
shuffle – True to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.
gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.

scale(gradient_accumulation)[source]¶

Scale down the batch_size and batch_max_tokens to \(\frac{1}{\text{gradient_accumulation}}\) of them respectively.

Parameters: gradient_accumulation – Number of mini-batches per update step.
Returns: batch_size, batch_max_tokens
Return type: tuple(int,int)

class hanlp.common.dataset.SortingSamplerBuilder(batch_size=None, batch_max_tokens=None, use_effective_tokens=False)[source]¶

Builds a SortingSampler.

Parameters

batch_max_tokens – Maximum tokens per batch.
use_effective_tokens – Whether to calculate effective number of tokens when applying the batch_max_tokens.
batch_size – Maximum samples per batch.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶

Build a Sampler given statistics of samples and other arguments.

Parameters

lengths – The lengths of samples.
shuffle – True to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.
gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.

class hanlp.common.dataset.KMeansSamplerBuilder(batch_max_tokens, batch_size=None, n_buckets=1)[source]¶

Builds a KMeansSampler.

Parameters

batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
n_buckets – Number of buckets. Clusters in terms of KMeans.

build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶

Build a Sampler given statistics of samples and other arguments.

Parameters

lengths – The lengths of samples.
shuffle – True to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.
gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.

HanLP Documentation

dataset

Contents

dataset¶

datasets¶

dataloaders¶

samplers¶

sampler builders¶