word2vec

word2vec

Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large unlabeled datasets. In this document, it is narrowly defined as a component to map discrete words to distributed representations which are dense vectors.

To perform such mapping:

import hanlp
word2vec = hanlp.load(hanlp.pretrained.word2vec.CONVSEG_W2V_NEWS_TENSITE_WORD_PKU)
word2vec('先进')
tensor([ 0.0057,  0.0493,  0.1801,  0.1991, -0.3161,  0.3856, -0.0404,  0.0280,
        -0.3597,  0.0012,  0.1900,  0.1331, -0.1469,  0.0129,  0.1645, -0.0124,
         0.1712,  0.0218, -0.0839,  0.0254, -0.0602, -0.0007,  0.0479, -0.0805,
        -0.0745, -0.1974, -0.0467, -0.0673, -0.1308,  0.1095, -0.1394,  0.0537,
        -0.1251,  0.1249, -0.1312, -0.1592,  0.0898, -0.0517, -0.0923,  0.0251,
        -0.1835,  0.0974,  0.0843, -0.0724,  0.0816, -0.1958, -0.0093, -0.0757,
        -0.2989,  0.0200])

These vectors have already been normalized to facilitate similarity computation:

import torch
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('优秀'), dim=0))
print(torch.nn.functional.cosine_similarity(word2vec('先进'), word2vec('水果'), dim=0))
tensor(0.7314)
tensor(-0.0448)

Using these similarity scores, the most similar words can be found:

word2vec.most_similar('上海')
{'广州': 0.8630875945091248,
 '北京': 0.8542779088020325,
 '天津': 0.8537402153015137,
 '深圳': 0.8526542782783508,
 '成都': 0.8255313634872437,
 '西安': 0.8215534687042236,
 '杭州': 0.8207105398178101,
 '厦门': 0.8186136484146118,
 '昆明': 0.8184518814086914,
 '武汉': 0.8055384755134583}

Word2Vec usually can not process OOV or phrases:

word2vec.most_similar('非常寒冷') # phrases are usually OOV
{}

Doc2Vec, as opposite to Word2Vec model, can create a vectorised representation by averaging a group of words. To enable Doc2Vec for OOV and phrases, pass doc2vec=True:

word2vec.most_similar('非常寒冷', doc2vec=True)
{'寒冷': 0.7510591745376587,
 '非常': 0.7510591745376587,
 '很': 0.7312490344047546,
 '比较': 0.6991080045700073,
 '无比': 0.685967743396759,
 '极其': 0.6834490895271301,
 '十分': 0.6786675453186035,
 '潮湿': 0.67008376121521,
 '焦躁不安': 0.6699174642562866,
 '阴冷': 0.6695235967636108}

All the pre-trained word2vec models and their details are listed below.

hanlp.pretrained.word2vec.MERGE_SGNS_BIGRAM_CHAR_300_ZH = 'http://download.hanlp.com/embeddings/extra/merge_sgns_bigram_char300_20220130_214613.txt.zip'

Chinese word embeddings trained with context features (word, ngram, character, and more) using Skip-Gram with Negative Sampling (SGNS) (Li et al. 2018).

hanlp.pretrained.word2vec.RADICAL_CHAR_EMBEDDING_100 = 'https://file.hankcs.com/hanlp/embeddings/radical_char_vec_20191229_013849.zip#character.vec.txt'

Chinese character embedding enhanced with rich radical information (He et al. 2018a).

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0.txt'

Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 100 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_LARGE_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0.txt'

Chinese word embeddings (Mikolov et al. 2013) with large vocabulary size and 200 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_100 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d100-v0.2.0-s.txt'

Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 100 dimension provided by Tencent AI lab.

hanlp.pretrained.word2vec.TENCENT_AILAB_EMBEDDING_SMALL_200 = 'https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0-s.tar.gz#tencent-ailab-embedding-zh-d200-v0.2.0-s.txt'

Chinese word embeddings (Mikolov et al. 2013) with small vocabulary size and 200 dimension provided by Tencent AI lab.