在 Python 中使用 scikit-learn kmeans 对文本文档进行聚类答案

【问题标题】：Clustering text documents using scikit-learn kmeans in Python在 Python 中使用 scikit-learn kmeans 对文本文档进行聚类
【发布时间】：2015-03-09 12:06:00
【问题描述】：

我需要实现scikit-learn's kMeans 来聚类文本文档。 example code 工作正常，但需要一些 20newsgroups 数据作为输入。我想使用相同的代码来聚类文档列表，如下所示：

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我需要在kMeans example code 中进行哪些更改才能将此列表用作输入？（简单地采用“数据集 = 文档”是行不通的）

【问题讨论】：

您提供的链接无效

标签： python python-2.7 scikit-learn cluster-analysis k-means

【解决方案1】：

这是一个更简单的例子：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

向量化文本，即将字符串转换为数字特征

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

集群文档

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

打印每个集群集群的热门词

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

如果您想更直观地了解其外观，请参阅this answer。

【讨论】：

谢谢你，但它在 end ='' 和 print() 的打印命令中给了我语法错误......我如何让它工作？ :s
哦，那是因为我是 Python 3，所以我编辑了我的答案。
@elyase：如何更改此代码以获取每个集群的中心句子？
@Crista23，这是不可能的。第一个句子被转换为数字向量（词袋表示），然后进行聚类，但这种转换不会保留词序（以及其他问题），因此您不能从中心向量返回到句子。您必须发挥创造力才能从质心中取回“某些东西”。
不清楚在这种情况下如何聚类句子而不是单词。在此示例中，单词聚类效果很好，但句子聚类效果不佳。