加快 pandas 中行的逻辑合并（基于条件）答案

【问题标题】：Speed up logical merging of rows in pandas (based on conditions)加快 pandas 中行的逻辑合并（基于条件）
【发布时间】：2019-04-12 17:14:24
【问题描述】：

我有一个包含数百万销售订单的数据框。每行代表购物车中的一项。我需要合并订单，尽管是在同一天订购的，但这些订单还是分开的。更准确地说，同一天同一客户的所有订单也应在同一天发货，都应分配给同一订单 ID（无论是哪一个）。

列：'customer_id'、'order_id'、...、'order_date'、'ship_date'

我的幼稚解决方案有效，但速度非常慢：

for _, customer_groups in df.groupby(by='customer_id'):
        for _, same_day_orders in customer_groups.groupby(by=['order_date', 'ship_date']):
            # Only merge if multiple orders per day.
            if same_day_orders.shape[0] > 1:
                # Now step through the line items two at a time.
                row_iterator = same_day_orders.iterrows()
                _, last_row = next(row_iterator)
                for it in row_iterator:
                    idx, current_row = it
                    # Check if the next line order has the same 'ship_date' and a different 'order_id'...
                    same_shipping_date = (last_row.ship_date == current_row.ship_date)
                    different_order_id = (last_row.order_id is not current_row.order_id)
                    # ... if so, merge the rows by assigning the second line item the same 'order_id' as its predecessor.
                    if (same_shipping_date and different_order_id):
                        df.loc[idx, 'order_id'] = last_row.order_id
                    last_row = current_row

例子：

index   customer_id  order_id   order_date  ship_date
1234    C0176        S0159      2018-03-24  2018-04-23
1235    C0176        S0163      2018-03-24  2018-04-23
1236    C0176        S0163      2018-03-24  2018-04-23
1237    C0176        S0171      2018-03-24  2018-05-01

index   customer_id  order_id   order_date  ship_date   
1234    C0176        S0159      2018-03-24  2018-04-23
1235    C0176        S0159      2018-03-24  2018-04-23
1236    C0176        S0159      2018-03-24  2018-04-23
1237    C0176        S0171      2018-03-24  2018-05-01

我怎样才能以更智能的方式解决这个问题，即更快（保持可读性也很好）？

【问题讨论】：

你能分组吗(['customer_id', 'order_date', 'ship_date'])
能否分享一个输入数据和输出数据的代表性例子（3行就够了，2个待分组的顺序，1个独立的顺序）
@sudonym 我添加了一个片段，希望对您有所帮助。如果满足条件，唯一的变化是在“order_id”列中。

标签： pandas performance pandas-groupby

【解决方案1】：

这对transform 来说是一项很棒的工作，它对分组序列执行转换，但确保结果的索引与输入的索引匹配（而不是像 @987654323 那样将组折叠成单个结果@ 做）。你可以这样使用它：

# Get groups of equal customer_id, order_date, and ship_date:
groups = df.groupby(['customer_id', 'order_date', 'ship_date'])

# Get the last order_id value, but ensure its index matches df:
collapsed_orders = groups['order_id'].transform(lambda x: x.iloc[-1])

# Overwrite the original order_id with this new value:
df['order_id'] = collapsed_orders

或者，作为单行：

df['order_id'] = df.groupby(['customer_id', 'order_date', 'ship_date'])['order_id'].transform(lambda x: x.iloc[-1])

【讨论】：

老实说 - 尊重！