在 Python 中查找点的所有后代答案

【问题标题】：Find all descendants for points in Python在 Python 中查找点的所有后代
【发布时间】：2018-04-17 20:40:13
【问题描述】：

我需要获取用 side_a - side_b 表示的链接的所有后代点（在一个数据帧中），直到到达每个 side_a 的端点（在另一个数据帧中）。所以：

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

重点是获取每个 side_a 值的所有点，直到从 df2 到达该值的 end_point。如果它有两个 end_point 值（如“k”），它应该是两个列表。

我有一些代码，但它不是用这种方法编写的，如果df1['side_a'] == df2['end_points'] 它会从 df1 中删除所有行，这会导致某些问题。但是，如果有人要我发布代码，我当然会。

想要的输出应该是这样的：

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

还有一件事，如果两边相同，那一点根本不需要列出，我可以稍后再追加，不管它更容易。

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

因此，如果我从 df2 中删除 side_a 等于 end_point 的行，则此代码的问题是有效。我不知道如何实现条件，如果在 side_b 列中捕获 df2，然后停止，不要进一步。

真的欢迎任何帮助或提示。提前致谢。

【问题讨论】：

您当然记得它不是“请为我编写代码” 网站吗？你能向我们展示你的作品吗？您的代码的确切问题是什么？
@rsm 正如我所说，我可以发布我的代码，但它会使帖子变得巨大，我认为它不会被任何助手使用。你可以写一个我需要添加我的代码的评论，我会的，只是不要自大。
您能否向我们展示您的工作，添加您拥有的任何相关（！）代码？并解释你遇到的问题？如果您希望我们为您的问题提供算法和实现 - 这不是本网站的工作方式。
@jovicbg df2 中有一个错字：q 的 end_point 应该是 r，而不是 s。对于您的问题，我可能有一个简单的解决方案，但它在大型数据帧上的性能不佳。您的数据框的大致大小是多少？
@QusaiAlothman 谢谢，我已经编辑过了。这是一个很好的 ent_point，但我跳过了一步（q - r）。它不是很大，有 4000 行。

标签： python pandas recursion tree descendant

【解决方案1】：

您可以使用 networkx 库和图表：

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)

输出：

  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]

【讨论】：

我有数字类型的数据，但我已转换为字符串，现在出现错误：（'源 7163 或目标 1019 不在 G'中，你'发生在索引 5'）。如果像这样捕获错误，有什么方法可以跳过，也许真的没有在 df2 中为某些 side_a 定义目标。
Hrm... 您可能需要首先过滤您的 df2 数据帧，以确定 side_a 和 end_point 都出现在 df 中的位置。
@ScottBoston 我做了，但没有帮助。您的代码运行良好，但不知道问题出在哪里。

【解决方案2】：

您的规则不一致，您的定义也不清楚，因此您可能需要在这里和那里添加一些约束，因为不清楚您到底在问什么。通过组织数据结构以适应问题并构建更强大的遍历函数（如下所示），可以更轻松地根据需要添加/编辑约束 - 并解决完全是问题。

将 df 转换为 dict 以更好地表示树结构

如果你将数据结构转换为更直观地解决问题，而不是试图在当前结构的上下文中解决问题，这个问题会简单得多。

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))

遍历树

现在数据结构很直观，这就更简单了。任何标准的Depth-First-Search algorithm w/ path saving 都可以解决问题。我修改了链接中的那个来使用这个例子。

编辑：再次阅读它看起来你在endpoint 中有一个搜索终止的条件（你需要在你的问题中更清楚什么是输入和什么是输出）。您可以调整dfs_path(tree,**target**, root) 并更改终止条件以仅返回正确的路径。

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
            else:
                stack.append((nextNode, path + [nextNode]))

从我们生成的生成器构建一个数据框

如果您对生成器不太满意，您可以构建 DFS 遍历，使其以列表的形式输出。而不是生成器

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():
        set_a.append(list(gen.keys()).pop())
        end_points.append(row)
                      
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

输出

df_2[['set_a','end_points']]


set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}

【讨论】：

我稍后会尽快尝试。 df2 中的列 end_point 表示每个 side_a 的迭代应该停止的位置。因此，A 的 end_point 不应包含 D。在 df2 中，A 的 end_point 是 C。也许会令人困惑，因为我也将输出列命名为 end_point。对不起。
我添加的评论应该可以解决这个问题。只需为端点添加一个参数并使终止条件 nextNode == endPoint
这是否适用于数字数据类型，而不是字符串？
我收到一个错误：KeyErrorTraceback（最近一次调用最后一次） in () 3 gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items] 4 for gen in gen_dict: ----> 5 for row in list(gen.values()).pop(): 6 set_a.append(list(gen.keys()).pop()) 7 end_points.append(row) KeyError: '1'
我刚刚运行了发布的确切代码，它输出了正确的值。您是否曾在某个时候将变量命名为 list？

【解决方案3】：

如果您可以接受额外的导入，则可以将其视为图形上的路径问题，并使用 NetworkX 在几行中解决：

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 
                  axis=1)

outdf 看起来像这样。请注意，这包含集合而不是您想要的输出中的列表 - 这允许以简单的方式组合所有路径。

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}

【讨论】：