word2vec
word2vec¶
Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large unlabeled datasets. In this document, it is narrowly defined as a component to map discrete words to distributed representations which are dense vectors.
To perform such mapping:
import hanlp
word2vec = hanlp.load(hanlp.pretrained.word2vec.CONVSEG_W2V_NEWS_TENSITE_WORD_PKU)
word2vec('先进')
tensor([ 0.0057, 0.0493, 0.1801, 0.1991, -0.3161, 0.3856, -0.0404, 0.0280,
-0.3597, 0.0012, 0.1900, 0.1331, -0.1469, 0.0129, 0.1645, -0.0124,
0.1712, 0.0218, -0.0839, 0.0254, -0.0602, -0.0007, 0.0479, -0.0805,
-0.0745, -0.1974, -0.0467, -0.0673, -0.1308, 0.1095, -0.1394, 0.0537,
-0.1251, 0.1249, -0.1312, -0.1592, 0.0898, -0.0517, -0.0923, 0.0251,
-0.1835, 0.0974, 0.0843, -0.0724, 0.0816, -0.1958, -0.0093, -0.0757,
-0.2989, 0.0200])
These vectors have already been normalized to facilitate similarity computation:
import torch
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('优秀'), dim=0))
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('水果'), dim=0))
tensor(0.7314)
tensor(-0.0448)
Using these similarity scores, the most similar words can be found:
word2vec.most_similar('上海')
{'广州': 0.8630875945091248,
'北京': 0.8542779088020325,
'天津': 0.8537402153015137,
'深圳': 0.8526542782783508,
'成都': 0.8255313634872437,
'西安': 0.8215534687042236,
'杭州': 0.8207105398178101,
'厦门': 0.8186136484146118,
'昆明': 0.8184518814086914,
'武汉': 0.8055384755134583}
Word2Vec usually can not process OOV or phrases:
word2vec.most_similar('非常寒冷') # phrases are usually OOV
{}
Doc2Vec, as opposite to Word2Vec model, can create a vectorised representation by averaging a group of words. To enable Doc2Vec for OOV and phrases, pass doc2vec=True
:
word2vec.most_similar('非常寒冷', doc2vec=True)
{'寒冷': 0.7510591745376587,
'非常': 0.7510591745376587,
'很': 0.7312490344047546,
'比较': 0.6991080045700073,
'无比': 0.685967743396759,
'极其': 0.6834490895271301,
'十分': 0.6786675453186035,
'潮湿': 0.67008376121521,
'焦躁不安': 0.6699174642562866,
'阴冷': 0.6695235967636108}
All the pre-trained word2vec models and their details are listed below.
- hanlp.pretrained.word2vec.MERGE_SGNS_BIGRAM_CHAR_300_ZH = 'http://download.hanlp.com/embeddings/extra/merge_sgns_bigram_char300_20220130_214613.txt.zip'¶
Chinese word embeddings trained with context features (word, ngram, character, and more) using Skip-Gram with Negative Sampling (SGNS) (Li et al. 2018).
- hanlp.pretrained.word2vec.RADICAL_CHAR_EMBEDDING_100 = 'https://file.hankcs.com/hanlp/embeddings/radical_char_vec_20191229_013849.zip#character.vec.txt'¶
Chinese character embedding enhanced with rich radical information (He et al. 2018a).
- hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0.txt'¶
Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 100 dimension provided by Tencent AI lab.
- hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0.txt'¶
Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 200 dimension provided by Tencent AI lab.
- hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0-s.txt'¶
Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 100 dimension provided by Tencent AI lab.
- hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0-s.txt'¶
Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 200 dimension provided by Tencent AI lab.