【发布时间】:2017-03-24 07:40:14
【问题描述】:
将 Kmeans 与 TF-IDF 矢量化器一起使用是否有可能获得出现在多个集群中的术语?
这里是示例数据集:
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
我使用 TF-IDF 向量化器进行特征提取:
vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s,' % terms[ind],
print
当我使用 scikit-learn 中的 KMeans 对文档进行聚类时,结果如下:
Top terms per cluster:
Cluster 0: user, eps, interface, human, response, time, computer, management, engineering, testing,
Cluster 1: trees, intersection, paths, random, generation, unordered, binary, graph, interface, human,
Cluster 2: minors, graph, survey, widths, ordering, quasi, iv, trees, engineering, eps,
我们可以看到一些术语出现在多个集群中(例如,graph 在集群 1 和 2 中,eps 在集群 0 和 2 中)。
聚类结果是否错误?还是因为每个文档的上述术语的 tf-idf 分数不同而可以接受?
【问题讨论】:
-
好的,你使用
KMeans,但是你如何向量化你的数据? -
将 TfidfVectorizer 与我的语言停用词列表中的停用词一起使用。
-
你能展示你的代码来执行集群吗?
-
我编辑了上面的问题:)
-
@ArdiTan 你能说明你是如何获得每个集群的顶级术语的吗?
标签: python scikit-learn cluster-analysis k-means tf-idf