Kmeans：术语出现在多个集群中？答案

【问题标题】：Kmeans: Terms occurring in more than one cluster?Kmeans：术语出现在多个集群中？
【发布时间】：2017-03-24 07:40:14
【问题描述】：

将 Kmeans 与 TF-IDF 矢量化器一起使用是否有可能获得出现在多个集群中的术语？

这里是示例数据集：

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我使用 TF-IDF 向量化器进行特征提取：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s,' % terms[ind],
    print

当我使用 scikit-learn 中的 KMeans 对文档进行聚类时，结果如下：

Top terms per cluster:
Cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing,
Cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human,
Cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps,

我们可以看到一些术语出现在多个集群中（例如，graph 在集群 1 和 2 中，eps 在集群 0 和 2 中）。

聚类结果是否错误？还是因为每个文档的上述术语的 tf-idf 分数不同而可以接受？

【问题讨论】：

好的，你使用KMeans，但是你如何向量化你的数据？
将 TfidfVectorizer 与我的语言停用词列表中的停用词一起使用。
你能展示你的代码来执行集群吗？
我编辑了上面的问题:)
@ArdiTan 你能说明你是如何获得每个集群的顶级术语的吗？

标签： python scikit-learn cluster-analysis k-means tf-idf

【解决方案1】：

我认为你对你想要做什么有点困惑。您使用的代码为您提供了文档的聚类，而不是术语。这些术语是您进行聚类的维度。

如果要查找每个文档属于哪个集群，只需使用predict 或fit_predict 方法，如下所示：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
for n in range(9):
    print("Doc %d belongs to cluster %d. " % (n, km.predict(feature[n])))

你会得到：

Doc 0 belongs to cluster 2. 
Doc 1 belongs to cluster 1. 
Doc 2 belongs to cluster 2. 
Doc 3 belongs to cluster 2. 
Doc 4 belongs to cluster 1. 
Doc 5 belongs to cluster 0. 
Doc 6 belongs to cluster 0. 
Doc 7 belongs to cluster 0. 
Doc 8 belongs to cluster 1.

看看User Guide of Scikit-learn

【讨论】：