为组创建唯一 ID答案

【问题标题】：Create unique ids for a group为组创建唯一 ID
【发布时间】：2017-02-10 18:52:53
【问题描述】：

我正在解决一个问题，我必须对相关项目进行分组并为它们分配一个唯一的 ID。我已经用 python 编写了代码，但它没有返回预期的输出。我需要帮助来完善我的逻辑。代码如下：

data = {}
child_list = []


for index, row in df.iterrows():
    parent = row['source']
    child = row['target']
    #print 'Parent: ', parent
    #print 'Child:', child
    child_list.append(child)
    #print child_list
    if parent not in data.keys():
        data[parent] = []
    if parent != child:
        data[parent].append(child)
    #print data

op = {}
gid = 0


def recursive(op,x,gid):
    if x in data.keys() and data[x] != []:
        for x_child in data[x]:
            if x_child in data.keys():
                op[x_child] = gid
                recursive(op,x_child,gid)
            else:
                op[x] = gid
    else:
        op[x] = gid


for key in data.keys():
    #print "Key: ", key
    if key not in child_list:
        gid = gid + 1
        op[key] = gid
        for x in data[key]:
            op[x] = gid
            recursive(op,x,gid)

related = pd.DataFrame({'items':op.keys(),
                  'uniq_group_id': op.values()})
mapped.sort_values('items')

以下示例

Input:
source  target
a        b
b        c
c        c
c        d
d        d
e        f
a        d
h        a
i        f  

Desired Output:
item     uniq_group_id
a         1 
b         1
c         1
d         1
h         1
e         2
f         2
i         2

我的代码在下面给出了错误的输出。

item    uniq_group_id
a       3
b       3
c       3
d       3
e       1
f       2
h       3
i       2

另一个例子

Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item    uniq_group_id
a       1
b       1
c       1
d       1
e       2
f       2

My code Output:
item    uniq_group_id
e   1
f   1

行的顺序或组 ID 无关紧要。这里重要的是为相关项目分配相同的唯一标识符。整个问题是找到相关的项目组并为它们分配一个唯一的组 ID。

【问题讨论】：

标签： python

【解决方案1】：

我没有仔细分析您的代码，但看起来错误是因为您填充 data 字典的方式。它将子节点存储为其父节点的邻居，但它还需要将父节点存储为子节点的邻居。

我决定改写 Aseem Goyal 编写的 this pseudocode，而不是尝试修复您的代码。下面的代码从简单的 Python 列表中获取输入数据，但它应该很容易适应 Pandas 数据框。

''' Find all the connected components of an undirected graph '''

from collections import defaultdict

src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']

nodes = sorted(set(src + tgt))
print('Nodes', nodes)

neighbors = defaultdict(set)
for u, v in zip(src, tgt):
    neighbors[u].add(v)
    neighbors[v].add(u)

print('Neighbors')
for n in nodes:
    print(n, neighbors[n])

visited = {}
def depth_first_traverse(node, group_id):
    for n in neighbors[node]:
        if n not in visited:
            visited[n] = group_id
            depth_first_traverse(n, group_id)

print('Groups')
group_id = 1
for n in nodes:
    if n not in visited:
        visited[n] = group_id
        depth_first_traverse(n, group_id)
        group_id += 1
    print(n, visited[n])

输出

Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2

此代码是为 Python 3 编写的，但也可以在 Python 2 上运行。如果您确实在 Python 2 上运行它，您应该在导入语句的顶部添加 from __future__ import print_function；这不是绝对必要的，但它会使输出看起来更好。

【讨论】：

谢谢。这个逻辑适用于我的用例。

【解决方案2】：

您可以为此使用Union-Find, or Disjoint-Sets algorithm。有关更完整的说明，请参阅this related answer。基本上，您需要两个函数 union 和 find 来创建 leaders 或前辈的树（即嵌套字典）：

leaders = collections.defaultdict(lambda: None)

def find(x):
    l = leaders[x]
    if l is not None:
        l = find(l)
        leaders[x] = l
        return l
    return x

def union(x, y):
    lx, ly = find(x), find(y)
    if lx != ly:
        leaders[lx] = ly

您可以将其应用于您的问题，如下所示：

df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                   'target': ['b','c','c','d','d','f','d','a','f','a']})

# build the tree
for _, row in df.iterrows():
    union(row["source"], row["target"])

# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
    groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
    print(num, group)

结果：

1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}

【讨论】：