# word2vec¶

Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large unlabeled datasets. In this document, it is narrowly defined as a component to map discrete words to distributed representations which are dense vectors.

To perform such mapping:

import hanlp
word2vec('先进')

tensor([ 0.0057,  0.0493,  0.1801,  0.1991, -0.3161,  0.3856, -0.0404,  0.0280,
-0.3597,  0.0012,  0.1900,  0.1331, -0.1469,  0.0129,  0.1645, -0.0124,
0.1712,  0.0218, -0.0839,  0.0254, -0.0602, -0.0007,  0.0479, -0.0805,
-0.0745, -0.1974, -0.0467, -0.0673, -0.1308,  0.1095, -0.1394,  0.0537,
-0.1251,  0.1249, -0.1312, -0.1592,  0.0898, -0.0517, -0.0923,  0.0251,
-0.1835,  0.0974,  0.0843, -0.0724,  0.0816, -0.1958, -0.0093, -0.0757,
-0.2989,  0.0200])


These vectors have already been normalized to facilitate similarity computation:

import torch
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('优秀'), dim=0))
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('水果'), dim=0))

tensor(0.7314)
tensor(-0.0448)


Using these similarity scores, the most similar words can be found:

word2vec.most_similar('上海')

{'广州': 0.8630875945091248,
'北京': 0.8542779088020325,
'天津': 0.8537402153015137,
'深圳': 0.8526542782783508,
'成都': 0.8255313634872437,
'西安': 0.8215534687042236,
'杭州': 0.8207105398178101,
'厦门': 0.8186136484146118,
'昆明': 0.8184518814086914,
'武汉': 0.8055384755134583}


Word2Vec usually can not process OOV or phrases:

word2vec.most_similar('非常寒冷') # phrases are usually OOV

{}


Doc2Vec, as opposite to Word2Vec model, can create a vectorised representation by averaging a group of words. To enable Doc2Vec for OOV and phrases, pass doc2vec=True:

word2vec.most_similar('非常寒冷', doc2vec=True)

{'寒冷': 0.7510591745376587,
'非常': 0.7510591745376587,
'很': 0.7312490344047546,
'比较': 0.6991080045700073,
'无比': 0.685967743396759,
'极其': 0.6834490895271301,
'十分': 0.6786675453186035,
'潮湿': 0.67008376121521,
'焦躁不安': 0.6699174642562866,
'阴冷': 0.6695235967636108}


All the pre-trained word2vec models and their details are listed below.

Chinese word embeddings trained with context features (word, ngram, character, and more) using Skip-Gram with Negative Sampling (SGNS) (Li et al. 2018).

Chinese character embedding enhanced with rich radical information (He et al. 2018a).

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0.txt'

Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 100 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0.txt'

Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 200 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0-s.txt'

Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 100 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0-s.txt'

Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 200 dimension provided by Tencent AI lab.