Python pandas 总结了数据框中的往返行程答案

【问题标题】：Python pandas summarize round trip in dataframePython pandas 总结了数据框中的往返行程
【发布时间】：2021-04-01 12:47:35
【问题描述】：

我有一个数据框（约 30 000 行）按车站代码计算的行程。

|station from|station to|count|
|:-----------|:---------|:----|
|20001       |20040     |55   |
|20040       |20001     |67   |
|20007       |20080     |100  |
|20080       |20007     |50   |

如何在有许多回程的地方获得 df 并且删除了额外的回程行，例如

|station from|station to|count|count_back|
|:-----------|:---------|:----|:---------|
|20001       |20040     |55   |67        |
|20007       |20080     |100  |50        |

我的解决方案是

复制数据框
制作复合键，更改重复数据框中的出发站和目的地站
合并
删除不必要的列和行。

但这似乎效率很低

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

这是一个简单的解决方案，无需往返即可处理案件。

import pandas as pd
import numpy as np
df = pd.DataFrame({"station from":[20001,20040,20007,20080, 2, 3],
                   "station to":[20040,20001,20080,20007, 1, 4],
                   "count":[55,67,100,50, 20, 40]})
df

df = df.set_index(["station from", "station to"])
df["count_back"] = df.apply(lambda row: df["count"].get((row.name[::-1])), axis=1)
mask_rows_to_delete = df.apply(lambda row: row.name[0] > row.name[1] and row.name[::-1] in df.index, axis=1)
df = df[~mask_rows_to_delete].reset_index()
df

【讨论】：

感谢您的回答。在我的数据框中，有些行程没有回程，所以我收到此错误Passing list-likes to .loc or [] with any missing labels is no longer supported
我发布了一个新的解决方案（更简单）来处理这种情况。
好吧，所以也许你可以接受我的回答，或者至少支持它:-)

【解决方案2】：

即使面对重复的条目，这也有效，而且速度非常快（

def roundtrip(df):
    a, b, c, d = 'station from', 'station to', 'count', 'count_back'
    idx = df[a] > df[b]
    df = df.assign(**{d: 0})
    df.loc[idx, [a, b, c, d]] = df.loc[idx, [b, a, d, c]].values
    return df.groupby([a, b]).sum()

关于您的示例数据（是的，如果您愿意，可以.reset_index()）：

>>> roundtrip(df)
                         count  count_back
station from station to                   
20001        20040          54          55
20007        20080         100          50

计时测试：

n = 1_000_000
df = pd.DataFrame({
    'station from': np.random.randint(1000, 2000, n),
    'station to': np.random.randint(1000, 2000, n),
    'count': np.random.randint(0, 200, n),
})

%timeit roundtrip(df)
217 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

（在 100K 行上，每个循环为 32.4 ms ± 333 µs）

【讨论】：

看起来很神奇。谢谢！我可以使用类似的附加字符串列（lgot id）像这样|station from|station to|lgot tipe|count|，还是更容易将 lgot 类型转换为 lgot id 并与站连接然后使用建议的函数
可能，但是如果没有看到示例就很难猜到...另外，请记住选择已接受的答案并为所有有用的答案投票。

【解决方案3】：

让我们试试sort 车站和支点：

# the two stations
cols = ['station from', 'station to']

# back and fort
df['col'] = np.where(df['station from'] < df['station to'], 'count', 'count_back')

# rearrange the stations
df[cols] = np.sort(df[cols], axis=1)

# pivot
print(df.pivot(index=cols, columns='col', values='count')
   .reset_index()
)

输出：

col  station from  station to  count  count_back
0           20001       20040     55          67
1           20007       20080    100          50

【讨论】：

这对我有用，结果数据与 Pierre D 的解决方案相同
但速度有点慢，不够健壮 ;-)