【问题标题】:How to plot tsne on word2vec (created from gensim) for the most_similar 20 cases?对于 most_similar 20 案例,如何在 word2vec(从 gensim 创建)上绘制 tsne?
【发布时间】:2022-01-13 00:09:45
【问题描述】:

我正在使用 TSNE 绘制经过训练的 word2vec 模型(由 gensim 创建):

labels = []
tokens = []

for word in model.wv.vocab:
    tokens.append(model[word])
    labels.append(word)

tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)

x = []
y = []
for value in new_values:
    x.append(value[0])
    y.append(value[1])
    
plt.figure(figsize=(50, 50)) 
for i in range(len(x)):
    plt.scatter(x[i],y[i])
    plt.annotate(labels[i],
                 xy=(x[i], y[i]),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
plt.show()

就像内置的 gensim 方法'most_similar',例如。

w2v_model.wv.most_similar(postive=['word'], topn=20)

将输出 20 个与“word”最相似的单词,我只想绘制给定单词中最相似的单词 (n=20)。关于如何修改情节以做到这一点的任何建议?

【问题讨论】:

    标签: python gensim word2vec tsne


    【解决方案1】:

    使用包中的示例:

    from gensim.test.utils import common_texts
    from gensim.models import Word2Vec
    from sklearn.manifold import TSNE
    import matplotlib.pyplot as plt
    
    model = Word2Vec(sentences=common_texts, window=5, min_count=1)
    
    labels = [i for i in model.wv.vocab.keys()]
    tokens = model[labels]
    
    tsne_model = TSNE(init='pca',learning_rate='auto')
    new_values = tsne_model.fit_transform(tokens)
    

    tsne 看起来像这样:

    plt.figure(figsize=(7, 5)) 
    for i in range(new_values.shape[0]):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    

    提取最相似的“树”(在我的例子中是 5 个):

    most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
    most_sim_words
    ['human', 'graph', 'time', 'interface', 'system']
    

    您可以使用您拥有的代码,只需遍历最常用的单词,然后使用index() 获取它们在tokens 中的索引:

    for word in most_sim_words:
        i = labels.index(word)
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-07-01
      • 2018-06-16
      • 1970-01-01
      • 1970-01-01
      • 2018-07-21
      • 2021-10-11
      • 2018-11-03
      • 1970-01-01
      相关资源
      最近更新 更多