dataset
Contents
dataset¶
This module provides base definition for datasets, dataloaders and samplers.
datasets¶
- class hanlp.common.dataset.Transformable(transform: Optional[Union[Callable, List]] = None)[source]¶
An object which can be transformed with a list of functions. It is the final result of an object being passed through a list of functions, while these functions are kept in a list.
- Parameters
transform – A transform function or a list of functions.
- append_transform(transform: Callable)[source]¶
Append a transform to its list of transforms.
- Parameters
transform – A new transform to be appended.
- Returns
Itself.
- insert_transform(index: int, transform: Callable)[source]¶
Insert a transform to a certain position.
- Parameters
index – A certain position.
transform – A new transform.
- Returns
Itself.
- transform_sample(sample: dict, inplace=False) dict [source]¶
Apply transforms to a sample.
- Parameters
sample – A sample, which is a
dict
holding features.inplace –
True
to apply transforms inplace.
Attention
If any transform modifies existing features, it will modify again and again when
inplace=True
. For example, if a transform insert aBOS
token to a list inplace, and it is called twice, then 2BOS
will be inserted which might not be an intended result.- Returns
Transformed sample.
- class hanlp.common.dataset.TransformableDataset(data: Union[str, List], transform: Optional[Union[Callable, List]] = None, cache=None, generate_idx=None)[source]¶
A
Dataset
which can be applied with a list of transform functions.- Parameters
data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache –
True
to enable caching, so that transforms won’t be called twice.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.
- __getitem__(index: Union[int, slice]) Union[dict, List[dict]] [source]¶
Get the index-th sample in this dataset.
- Parameters
index – Either a integer index of a list of indices.
Returns: Either a sample or or list of samples depending on how many indices are passed in.
- k_fold(k, i)[source]¶
Perform k-fold sampling.
- Parameters
- Returns
The i-th fold subset of this dataset.
- Return type
- load_data(data, generate_idx=False)[source]¶
A intermediate step between constructor and calling the actual file loading method.
- Parameters
data – If data is a file, this method calls
load_file()
to load it.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.
Returns: Loaded samples.
- abstract load_file(filepath: str)[source]¶
The actual file loading logic.
- Parameters
filepath – The path to a dataset.
- prune(criterion: Callable, logger: Optional[logging.Logger] = None)[source]¶
Prune (to discard) samples according to a criterion.
- Parameters
criterion – A functions takes a sample as input and output
True
if the sample needs to be pruned.logger – If any, log statistical messages using it.
- Returns
Size before pruning.
- Return type
- should_load_file(data) bool [source]¶
Determines whether data is a filepath.
- Parameters
data – Data to check.
Returns:
True
to indicate it’s a filepath.
- split(*ratios)[source]¶
Split dataset into subsets.
- Parameters
*ratios – The ratios for each subset. They can be any type of numbers which will be normalized. For example,
8, 1, 1
are equivalent to0.8, 0.1, 0.1
.- Returns
A list of subsets.
- Return type
dataloaders¶
- class hanlp.common.dataset.PadSequenceDataLoader(dataset, batch_size=32, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, pad: Optional[dict] = None, vocabs: Optional[hanlp.common.transform.VocabDict] = None, device=None, **kwargs)[source]¶
A dataloader commonly used for NLP tasks. It offers the following convenience.
- Bachify each field of samples into a
Tensor
if the field name satisfies the following criterion. Name ends with _id, _ids, _count, _offset, _span, mask
Name is in pad dict.
- Bachify each field of samples into a
Pad each field according to field name, the vocabs and pad dict.
Move
Tensor
onto device.
- Parameters
dataset – A
Dataset
to be bachified.batch_size – Max size of each batch.
shuffle –
True
to shuffle batches.sampler – A
Sampler
to sample samples from data.batch_sampler – A
Sampler
to sample batches form all batches.num_workers – Number of workers for multi-thread loading. Note that multi-thread loading aren’t always faster.
collate_fn – A function to perform batchifying. It must be set to
None
in order to make use of the features this class offers.pin_memory – If samples are loaded in the Dataset on CPU and would like to be pushed to the GPU, enabling pin_memory can speed up the transfer. It’s not useful since most data field are not in Tensor type.
drop_last – Drop the last batch since it could be half-empty.
timeout – For multi-worker loading, set a timeout to wait for a worker.
worker_init_fn – Init function for multi-worker.
multiprocessing_context – Context for multiprocessing.
pad – A dict holding field names and their padding values.
vocabs – A dict of vocabs so padding value can be fetched from it.
device – The device tensors will be moved onto.
**kwargs – Other arguments will be passed to
torch.utils.data.Dataset.__init__()
- class hanlp.common.dataset.PrefetchDataLoader(dataloader: torch.utils.data.dataloader.DataLoader, prefetch: int = 10, batchify: Optional[Callable] = None)[source]¶
A dataloader wrapper which speeds up bachifying using multi-processing. It works best for dataloaders of which the bachify takes very long time. But it introduces extra GPU memory consumption since prefetched batches are stored in a
Queue
on GPU.Caution
PrefetchDataLoader only works in spawn mode with the following initialization code:
Examples:
if __name__ == '__main__': import torch torch.multiprocessing.set_start_method('spawn')
And these 2 lines MUST be put into
if __name__ == '__main__':
block.- Parameters
dataloader – A
DatasetLoader
to be prefetched.prefetch – Number of batches to prefetch.
batchify – A bachify function called on each batch of samples. In which case, the inner dataloader shall return samples without really bachify them.
samplers¶
- class hanlp.common.dataset.BucketSampler(buckets: Dict[float, List[int]], batch_max_tokens, batch_size=None, shuffle=False)[source]¶
A bucketing based sampler which groups samples into buckets then creates batches from each bucket.
- Parameters
buckets – A dict of which keys are some statistical numbers of each bucket, and values are the indices of samples in each bucket.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches and samples in a batch.
- class hanlp.common.dataset.KMeansSampler(lengths, batch_max_tokens, batch_size=None, shuffle=False, n_buckets=1)[source]¶
A bucket sampler which groups samples using KMeans on their lengths.
- Parameters
lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches. Samples in the same batch won’t be shuffled since the ordered sequence is helpful to speed up RNNs.n_buckets – Number of buckets. Clusters in terms of KMeans.
- class hanlp.common.dataset.SortingSampler(lengths: List[int], batch_size=None, batch_max_tokens=None, use_effective_tokens=False, shuffle=False)[source]¶
A sampler which sorts samples according to their lengths. It takes a continuous chunk of sorted samples to make a batch. The effective batch size is determined by
batch_size
,batch_max_tokens
anduse_effective_tokens
.- Parameters
lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
use_effective_tokens – Whether to calculate the effective number of tokens after padding when applying the
batch_max_tokens
.batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches and samples in a batch.
sampler builders¶
- class hanlp.common.dataset.SamplerBuilder[source]¶
- abstract build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler [source]¶
Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.
- class hanlp.common.dataset.SortingSamplerBuilder(batch_size=None, batch_max_tokens=None, use_effective_tokens=False)[source]¶
Builds a
SortingSampler
.- Parameters
batch_max_tokens – Maximum tokens per batch.
use_effective_tokens – Whether to calculate effective number of tokens when applying the batch_max_tokens.
batch_size – Maximum samples per batch.
- build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler [source]¶
Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.
- class hanlp.common.dataset.KMeansSamplerBuilder(batch_max_tokens, batch_size=None, n_buckets=1)[source]¶
Builds a
KMeansSampler
.- Parameters
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
n_buckets – Number of buckets. Clusters in terms of KMeans.
- build(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) torch.utils.data.sampler.Sampler [source]¶
Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.