word2vec¶

class hanlp.layers.embeddings.word2vec.Word2VecEmbedding(field, embed: Union[int, str], extend_vocab=True, pad=None, unk=None, lowercase=False, trainable=False, second_channel=False, word_dropout: float = 0, normalize=False, cpu=False, init='zeros')[source]¶

A word2vec style embedding builder which maps a token to its embedding through looking up a pre-defined table.

Parameters

field – The field to work on. Usually some token fields.
embed – A path to pre-trained embedding file or an integer defining the size of randomly initialized embedding.
extend_vocab – Unlock vocabulary of training set to add those tokens in pre-trained embedding file.
pad – The padding token.
unk – The unknown token.
lowercase – Convert words in pretrained embeddings into lowercase.
trainable – False to use static embeddings.
second_channel – A trainable second channel for each token, which will be added to pretrained embeddings.
word_dropout – The probability of randomly replacing a token with UNK.
normalize – l2 or std to normalize the embedding matrix.
cpu – Reside on CPU instead of GPU.
init – Indicate which initialization to use for oov tokens.

module(vocabs: hanlp.common.transform.VocabDict, **kwargs) → Optional[torch.nn.modules.module.Module][source]¶

Build a module for this embedding.

Parameters: **kwargs – Containing vocabs, training etc. Not finalized for now.
Returns: A module.

transform(vocabs: Optional[hanlp.common.transform.VocabDict] = None, **kwargs) → Optional[Callable][source]¶

Build a transform function for this embedding.

Parameters: **kwargs – Containing vocabs, training etc. Not finalized for now.
Returns: A transform function.

class hanlp.layers.embeddings.word2vec.Word2VecEmbeddingModule(field: str, embed: torch.nn.modules.sparse.Embedding, word_dropout: Optional[hanlp.layers.dropout.WordDropout] = None, cpu=False, second_channel=False, num_tokens_in_trn=None, unk_idx=1)[source]¶

A word2vec style embedding module which maps a token to its embedding through looking up a pre-defined table.

Parameters

field – The field to work on. Usually some token fields.
embed – An Embedding layer.
word_dropout – The probability of randomly replacing a token with UNK.
cpu – Reside on CPU instead of GPU.
second_channel – A trainable second channel for each token, which will be added to pretrained embeddings.
num_tokens_in_trn – The number of tokens in training set.
unk_idx – The index of UNK.

forward(batch: dict, **kwargs)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

HanLP Documentation

word2vec

word2vec¶