Python - 计算 word2vec 向量的层次聚类并将结果绘制为树状图答案

【问题标题】：Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogramPython - 计算 word2vec 向量的层次聚类并将结果绘制为树状图
【发布时间】：2017-05-18 16:17:14
【问题描述】：

我使用我的域文本语料库生成了一个 100D word2vec 模型，合并了常用短语，例如（good bye => good_bye）。然后我提取了 1000 个所需单词的向量。

所以我有一个 1000 numpy.array 像这样：

[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
 [-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
 ...
 ...[1000 Vectors]
]

单词数组是这样的：

["hello","hi","bye","good_bye"...1000]

我对我的数据运行了 K-Means，得到的结果是有意义的：

X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
    print(l,words[idx])

--- Output ---
0 hello
0 hi
1 bye
1 good_bye

0 = 问候 1 = 告别

不过，有些话让我觉得层次聚类更适合这个任务。我尝试过使用 AgglomerativeClustering，不幸的是……对于这个 Python nobee，事情变得复杂了，我迷路了。

如何对我的向量进行聚类，以便输出或多或少是一个树状图，就像在 this wiki 页面上找到的那样？

【问题讨论】：

AgglomerativeClustering 出了什么问题？如果您可以提供可重现的示例，那最好
我只是觉得自己在猜测，缺乏完成任务的知识

标签： python numpy machine-learning hierarchical-clustering word2vec

【解决方案1】：

到现在我也有同样的问题！在在线搜索后始终找到您的帖子后（关键字 = word2vec 上的层次聚类）。我必须给你一个可能有效的解决方案。

sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]

import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

l = linkage(model.wv.syn0, method='complete', metric='seuclidean')

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()

【讨论】：