围绕特定节点的图集群节点答案

【问题标题】：cluster nodes of graph around specific nodes围绕特定节点的图集群节点
【发布时间】：2019-03-29 10:55:51
【问题描述】：

考虑到来自networkx的节点图，我如何应用所有节点的kmean集群，其中特定节点被视为集群的质心。换句话说，假设我们有这个图表：

import networkx as nx

s = [0,3,2,3,4,5,1]
t = [1,2,7,4,6,6,5]
dist = [3,2,5,1,5,4,2]

G = nx.Graph()
for i in range(len(s)):
    G.add_edge(s[i],t[i],weight=dist[i])

我想在网络上应用 kmean 聚类，例如，我选择质心为 3 和 6，图将被相应地聚类以生成两个子图（或与我输入一样多的质心）

我一直在查看https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/ 此处的 kmean 聚类，它没有涵盖输入的质心，而是只考虑没有质心节点的聚类数。

【问题讨论】：

通常你需要一个可以应用的度量来测量质心和任何数据点之间的距离。在这种情况下，您的指标是什么？
因为我选择的是图中节点的中心，所以距离只是 nx.shortest_path_length(G, 3, G.nodes())
你的代码能更明确一点吗？
抱歉，这不是很可读：P 我们的想法是将每个节点与它最接近的质心相关联，对吧？！
是的，这正是我要找的！

标签： python python-3.x networkx k-means

【解决方案1】：

请注意，您不能直接将 k-means 聚类应用于网络，因为不一定存在衡量节点和质心之间距离的指标。但是...

.. 假设你假设：

加权最短路径的路径长度是一对节点之间的距离度量。
质心是节点。注意：在传统的 k-means 聚类中，centroids 本身不一定是数据点。

在这些假设下，如果将具有最短加权最短路径的质心与每个节点相关联，则到质心的距离总和是最小的。

所以程序可能是：

将每个节点与一个质心相关联，以使每个节点到其质心的距离总和最小（即距离的 withing 聚类总和）
更新质心
重复前两个步骤，直到质心稳定。

这个过程大致对应k-mean clustering的过程，即最小化簇内平方和（WCSS）。

虽然此过程类似于度量空间中数据点的 k-means 聚类，但我不会将其称为 k-means 聚类。特别是因为质心的位置仅限于网络中的节点。

以下是你可以如何使用 python 来解决这个问题：

1. 定义初始质心：

centroids = [3, 6]

2. 对于每个节点，获取到所有质心的所有最短路径。

例如：

shortest_paths = [[(cent, nx.shortest_path(
                 G, source=n ,target=cent, weight='weight'
             )) for cent in centroids] for n in G.nodes
             ]

这给出（在这里它们与质心的 id 一起报告）：

In [26]: shortest_paths                                                         
Out[26]: 
[[(3, [0, 1, 5, 6, 4, 3]), (6, [0, 1, 5, 6])],
[(3, [1, 5, 6, 4, 3]), (6, [1, 5, 6])],
[(3, [3]), (6, [3, 4, 6])],
[(3, [2, 3]), (6, [2, 3, 4, 6])],
[(3, [7, 2, 3]), (6, [7, 2, 3, 4, 6])],
[(3, [4, 3]), (6, [4, 6])],
[(3, [6, 4, 3]), (6, [6])],
[(3, [5, 6, 4, 3]), (6, [5, 6])]]

3. 计算实际距离，即对所有节点的所有最短路径求和路径上的权重：

例如：

distances = [
    [
        (
            sp[0],  # this is the id of the centroid
            sum([
                G[sp[1][i]][sp[1][i+1]]['weight'] 
                for i in range(len(sp[1]) - 1)
            ]) if len(sp[1]) > 1 else 0
        ) for sp in sps
    ] for sps in shortest_paths
    ]

所以距离是：

In [28]: distances                                                              
Out[28]: 
[[(3, 15), (6, 9)],
[(3, 12), (6, 6)],
[(3, 0), (6, 6)],
[(3, 2), (6, 8)],
[(3, 7), (6, 13)],
[(3, 1), (6, 5)],
[(3, 6), (6, 0)],
[(3, 10), (6, 4)]]

4. 获取所有节点距离最小的质心：

例如：

closest_centroid = [
    min(dist, key=lambda d: d[1])[0] for dist in distances
]

根据质心进行分组：

In [30]: closest_centroid                                                       
Out[30]: [6, 6, 3, 3, 3, 3, 6, 6]

5. 更新质心，因为当前的质心可能不再是组的实际质心：

方法：

# for each group
    # for each member of the group
        # get the distance of shortest paths to all the other members of the group
        # sum this distances
    # find the node with the minimal summed distance > this is the new centroid of the group

迭代：如果新质心与旧质心不同，则使用新质心并重复步骤 2.- 5。

最后一步：如果在步骤 5. 中找到的新质心与旧质心相同，或者您已达到迭代限制，将离每个节点最近的质心：

例如：

nodes = [n for n in G]  # the actual id of the nodes
cent_dict = {nodes[i]: closest_centroid[i] for i in range(len(nodes))}
nx.set_node_attributes(G, cent_dict, 'centroid')

或者nx.set_node_attributes(G, 'centroid', cent_dict)，如果你还在v1.x。

这将是一种对网络进行某种 k-means 聚类的方法。

希望对您有所帮助并祝您编码愉快！

【讨论】：