dataset¶
This module provides base definition for datasets, dataloaders and samplers.
datasets¶
-
class
hanlp.common.dataset.
Transformable
(transform: Union[Callable, List] = None)[source]¶ An object which can be transformed with a list of functions. It can be treated as an objected being passed through a list of functions, while these functions are kept in a list.
- Parameters
transform – A transform function or a list of functions.
-
append_transform
(transform: Callable)[source]¶ Append a transform to its list of transforms.
- Parameters
transform – A new transform to be appended.
- Returns
Itself.
-
insert_transform
(index: int, transform: Callable)[source]¶ Insert a transform to a certain position.
- Parameters
index – A certain position.
transform – A new transform.
- Returns
Itself.
-
transform_sample
(sample: dict, inplace=False) → dict[source]¶ Apply transforms to a sample.
- Parameters
sample – A sample, which is a
dict
holding features.inplace –
True
to apply transforms inplace.
Attention
If any transform modifies existing features, it will modify again and again when
inplace=True
. For example, if a transform insert aBOS
token to a list inplace, and it is called twice, then 2BOS
will be inserted which might not be an intended result.- Returns
Transformed sample.
-
class
hanlp.common.dataset.
TransformableDataset
(data: Union[str, List], transform: Union[Callable, List] = None, cache=None, generate_idx=None)[source]¶ A
Dataset
which can be applied with a list of transform functions.- Parameters
data – The local or remote path to a dataset, or a list of samples where each sample is a dict.
transform – Predefined transform(s).
cache –
True
to enable caching, so that transforms won’t be called twice.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.
-
__getitem__
(index: Union[int, slice]) → Union[dict, List[dict]][source]¶ Get the index-th sample in this dataset.
- Parameters
index – Either a integer index of a list of indices.
Returns: Either a sample or or list of samples depending on how many indices are passed in.
-
k_fold
(k, i)[source]¶ Perform k-fold sampling.
- Parameters
- Returns
The i-th fold subset of this dataset.
- Return type
-
load_data
(data, generate_idx=False)[source]¶ A intermediate step between constructor and calling the actual file loading method.
- Parameters
data – If data is a file, this method calls
load_file()
to load it.generate_idx – Create a
IDX
field for each sample to store its order in dataset. Useful for prediction when samples are re-ordered by a sampler.
Returns: Loaded samples.
-
abstract
load_file
(filepath: str)[source]¶ The actual file loading logic.
- Parameters
filepath – The path to a dataset.
-
prune
(criterion: Callable, logger: logging.Logger = None)[source]¶ Prune (to discard) samples according to a criterion.
- Parameters
criterion – A functions takes a sample as input and output
True
if the sample needs to be pruned.logger – If any, log statistical messages using it.
- Returns
Size before pruning.
- Return type
-
should_load_file
(data) → bool[source]¶ Determines whether data is a filepath.
- Parameters
data – Data to check.
Returns:
True
to indicate it’s a filepath.
-
split
(*ratios)[source]¶ Split dataset into subsets.
- Parameters
*ratios – The ratios for each subset. They can be any type of numbers which will be normalized. For example,
8, 1, 1
are equivalent to0.8, 0.1, 0.1
.- Returns
A list of subsets.
- Return type
dataloaders¶
-
class
hanlp.common.dataset.
PadSequenceDataLoader
(dataset, batch_size=32, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, pad: dict = None, vocabs: hanlp.common.transform.VocabDict = None, device=None, **kwargs)[source]¶ A dataloader commonly used for NLP tasks. It offers the following convenience.
- Bachify each field of samples into a
Tensor
if the field name satisfies the following criterion. Name ends with _id, _ids, _count, _offset, _span, mask
Name is in pad dict.
- Bachify each field of samples into a
Pad each field according to field name, the vocabs and pad dict.
Move
Tensor
onto device.
- Parameters
dataset – A
Dataset
to be bachified.batch_size – Max size of each batch.
shuffle –
True
to shuffle batches.sampler – A
Sampler
to sample samples from data.batch_sampler – A
Sampler
to sample batches form all batches.num_workers – Number of workers for multi-thread loading. Note that multi-thread loading aren’t always faster.
collate_fn – A function to perform batchifying. It must be set to
None
in order to make use of the features this class offers.pin_memory – If samples are loaded in the Dataset on CPU and would like to be pushed to the GPU, enabling pin_memory can speed up the transfer. It’s not useful since most data field are not in Tensor type.
drop_last – Drop the last batch since it could be half-empty.
timeout – For multi-worker loading, set a timeout to wait for a worker.
worker_init_fn – Init function for multi-worker.
multiprocessing_context – Context for multiprocessing.
pad – A dict holding field names and their padding values.
vocabs – A dict of vocabs so padding value can be fetched from it.
device – The device tensors will be moved onto.
**kwargs – Other arguments will be passed to
torch.utils.data.Dataset.__init__()
-
class
hanlp.common.dataset.
PrefetchDataLoader
(dataloader: torch.utils.data.dataloader.DataLoader, prefetch: int = 10, batchify: Callable = None)[source]¶ A dataloader wrapper which speeds up bachifying using multi-processing. It works best for dataloaders of which the bachify takes very long time. But it introduces extra GPU memory consumption since prefetched batches are stored in a
Queue
on GPU.Caution
PrefetchDataLoader only works in spawn mode with the following initialization code:
Examples:
if __name__ == '__main__': import torch torch.multiprocessing.set_start_method('spawn')
And these 2 lines MUST be put into
if __name__ == '__main__':
block.- Parameters
dataloader – A
DatasetLoader
to be prefetched.prefetch – Number of batches to prefetch.
batchify – A bachify function called on each batch of samples. In which case, the inner dataloader shall return samples without really bachify them.
samplers¶
-
class
hanlp.common.dataset.
BucketSampler
(buckets: Dict[float, List[int]], batch_max_tokens, batch_size=None, shuffle=False)[source]¶ A bucketing based sampler which groups samples into buckets then creates batches from each bucket.
- Parameters
buckets – A dict of which keys are some statistical numbers of each bucket, and values are the indices of samples in each bucket.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches and samples in a batch.
-
class
hanlp.common.dataset.
KMeansSampler
(lengths, batch_max_tokens, batch_size=None, shuffle=False, n_buckets=1)[source]¶ A bucket sampler which groups samples using KMeans on their lengths.
- Parameters
lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches. Samples in the same batch won’t be shuffled since the ordered sequence is helpful to speed up RNNs.n_buckets – Number of buckets. Clusters in terms of KMeans.
-
class
hanlp.common.dataset.
SortingSampler
(lengths: List[int], batch_size=None, batch_max_tokens=None, shuffle=False)[source]¶ A sampler which sort samples according to their lengths. It takes a continuous chunk of sorted samples to make a batch.
- Parameters
lengths – Lengths of each sample, usually measured by number of tokens.
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
shuffle –
True
to shuffle batches and samples in a batch.
sampler builders¶
-
class
hanlp.common.dataset.
SamplerBuilder
[source]¶ -
abstract
build
(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶ Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.
-
abstract
-
class
hanlp.common.dataset.
SortingSamplerBuilder
(batch_size=None, batch_max_tokens=None)[source]¶ Builds a
SortingSampler
.- Parameters
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
-
build
(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶ Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.
-
class
hanlp.common.dataset.
KMeansSamplerBuilder
(batch_max_tokens, batch_size=None, n_buckets=1)[source]¶ Builds a
KMeansSampler
.- Parameters
batch_max_tokens – Maximum tokens per batch.
batch_size – Maximum samples per batch.
n_buckets – Number of buckets. Clusters in terms of KMeans.
-
build
(lengths: List[int], shuffle=False, gradient_accumulation=1, **kwargs) → torch.utils.data.sampler.Sampler[source]¶ Build a
Sampler
given statistics of samples and other arguments.- Parameters
lengths – The lengths of samples.
shuffle –
True
to shuffle batches. Note samples in each mini-batch are not necessarily shuffled.gradient_accumulation – Number of mini-batches per update step.
**kwargs – Other arguments to be passed to the constructor of the sampler.