在 python 中绘制 sklearn 集群答案

【问题标题】：Plot the sklearn clusters in python在 python 中绘制 sklearn 集群
【发布时间】：2018-02-24 05:05:09
【问题描述】：

我有以下使用亲和力传播获得的 sklearn 集群。

import sklearn.cluster
import numpy as np

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
labels = affprop.labels_
#number of clusters
n_clusters_ = len(cluster_centers_indices)

现在我想绘制集群的输出。我是sklearn的新手。请建议我一种合适的方法来在 python 中绘制集群。是否可以使用 pandas 数据帧来做到这一点？

编辑：

我直接使用code in sklearn @MohammedKashif 指出的如下。

import sklearn.cluster

import numpy as np

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = sims[cluster_centers_indices[k]]
    plt.plot(sims[class_members, 0], sims[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in sims[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

但是，我得到的输出有点奇怪，如下所示（第二个聚类点（绿色）在蓝线上。因此，我认为它不应该作为一个单独的聚类点，也应该在蓝色簇）。如果我在代码中犯了任何错误，请告诉我。

编辑 2

正如 σηγ 所指出的，我补充说：

se = SpectralEmbedding(n_components=2, affinity='precomputed')
X = se.fit_transform(sims)
print(X)

但是，对于数组np.array([[0, 17, 10, 32, 32], [0, 17, 10, 32, 32], [0, 17, 10, 32, 33], [0, 17, 10, 32, 32], [0, 17, 10, 32, 32]])，它给了我 3 分，如下所示。这让我很困惑，因为所有 5 个数组都代表一个点。

请帮帮我。

【问题讨论】：

您可以在此处查看示例以获取更多参考：scikit-learn.org/stable/auto_examples/cluster/…
是的，您必须相应地更改变量名称。
我会说这看起来很符合预期 - 您只有 5 个数据点，其中 2 个是集群中心，另外 3 个分配给左上角/蓝色集群。所以这张图可能是我所期望的。你期待看到什么？
@Volka 在我看来，这条线只是在该点下方通过，而不是在它上面。您已经根据 5 个“特征”进行了聚类，但只绘制了前 2 个，因此没有看到它为什么聚类的全貌，尝试绘制其他组合以查看不同的聚类，或者可以调查诸如 PCA 或 TSNE 之类的东西来映射您的 5 个功能分为 2 个用于绘图。
@Volka Sims 看起来像相似矩阵，而不是特征或坐标数组。如果你想根据相似度来可视化数据，你应该选择一种直接作用于相似度矩阵的方法（例如 sklearn 中的SpectralEmbedding）。

标签： python matplotlib machine-learning scikit-learn cluster-analysis

【解决方案1】：

按照前面的例子，我会尝试这样的事情：

import sklearn.cluster
from sklearn.manifold import SpectralEmbedding
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

se = SpectralEmbedding(n_components=2, affinity='precomputed')
X = se.fit_transform(sims)

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

【讨论】：

有趣！ SpectralEmbedding 到底发生了什么？
光谱嵌入（又名拉普拉斯特征图）试图找到高维数据集的低维表示，以便低维表示中点之间的局部距离接近它们的距离（或相似性）在高维空间中（参见Wikipedia）。
实际上sklearn.manifold 中的许多流形学习方法旨在做同样的事情，但使用不同的算法。但是，它们中的大多数都需要一组特征向量或距离矩阵才能使用。
@σηγ 非常感谢您的精彩回答。我用np.array([[0, 17, 10, 32, 32], [0, 17, 10, 32, 32], [0, 17, 10, 32, 33], [0, 17, 10, 32, 32], [0, 17, 10, 32, 32]]) 尝试了你的代码尽管这五个数组代表同一个点，但它显示了 3 个不同的点。你知道为什么会这样吗？
我认为 SpectralEmbedding 不能很好地处理点重叠的情况。无论如何，这个新数组看起来不像一个相似矩阵（如果数组描述相同的点，为什么应该对应于自相似的对角元素不相等？）。如果这些实际上是特征向量，则可以将 SpectralEmbedding 部分替换为另一个投影，例如X = sklearn.manifold.MDS(n_components=2).fit_transform(new_array)。结果应该是一个只有一个数据点的图。