找出字符串属于集群的百分比/概率？答案

【问题标题】：Figuring out the percentage/probability a string belongs in a cluster?找出字符串属于集群的百分比/概率？
【发布时间】：2019-07-25 05:56:35
【问题描述】：

我有一个 KMeans 聚类脚本，它根据文本内容组织一些文档。这些文档属于 3 个集群中的 1 个，但似乎非常“是”或“否”，我希望能够了解每个文档与集群的相关性。

例如。文档 A 在集群 1 中 90% 匹配，文档 B 在集群 1 中但 45% 匹配。

因此我可以创建某种阈值来表示，我只想要 80% 或更高的文档。

dict_of_docs = {'Document A':'some text content',...'Document Z':'some more text content'}

# Vectorizing the data, my data is held in a Dict, so I just want the values.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
X = X.toarray()


# 3 Clusters as I know that there are 3, otherwise use Elbow method
# Then add the vectorized data to the Vocabulary
NUMBER_OF_CLUSTERS = 3
km = KMeans(
    n_clusters=NUMBER_OF_CLUSTERS,
    init='k-means++',
    max_iter=500)
km.fit(X)


# First: for every document we get its corresponding cluster
clusters = km.predict(X)

# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X)

scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component

plt.style.use('ggplot')

fig, ax = plt.subplots()
fig.set_size_inches(20,10)

# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}


# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
    ix = np.where(clusters == group)
    ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)

ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()

order_centroids = km.cluster_centers_.argsort()[:, ::-1]

# Print out top terms for each cluster
terms = vectorizer.get_feature_names()
for i in range(3):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

for doc in dict_of_docs:
    text = dict_of_docs[doc]
    Y = vectorizer.transform([text])
    prediction = km.predict(Y)
    print(prediction, doc)

【问题讨论】：

标签： python cluster-analysis k-means

【解决方案1】：

我认为不可能完全按照您的意愿行事，因为 k-means 并不是真正的概率模型，而且它的 scikit-learn 实现（我假设您正在使用）只是没有提供正确的接口。

我建议的一个选项是使用KMeans.score 方法，该方法不提供概率输出，但提供的分数越大，点越接近最近的集群。您可以以此作为阈值，例如说“文档 A 在集群 1 中，得分为 -.01，所以我保留它”或“文档 B 在集群 2 中，得分为 -1000，所以我忽略它”。

另一种选择是改用GaussianMixture 模型。高斯混合是与 k-means 非常相似的模型，它通过GaussianMixture.predict_proba 提供您想要的概率。

【讨论】：