【发布时间】:2018-11-22 10:40:40
【问题描述】:
我从https://pythonprogramminglanguage.com/kmeans-text-clustering/ 找到了以下关于文档聚类的代码。虽然我从整体上理解了 k-means 算法,但对于每个集群的顶级术语代表什么以及如何计算它,我有点难以理解?它是集群中出现频率最高的词吗?我读过的一篇博文说,最后输出的单词代表“最接近集群质心的前 n 个单词”(但实际单词“最接近”集群质心意味着什么)。我真的很想了解正在发生的事情的细节和细微差别。谢谢!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
【问题讨论】:
标签: python scikit-learn k-means topic-modeling centroid