在 Pandas 和 Numpy 中合并 DataFrame答案

【问题标题】：Merging DataFrames in Pandas and Numpy在 Pandas 和 Numpy 中合并 DataFrame
【发布时间】：2019-12-02 13:20:41
【问题描述】：

我有两个与销售分析相关的不同数据框。我想将它们合并在一起以创建一个新的数据框，其中包含 customer_id、name 和 total_spend 列。两个数据框如下：

import pandas as pd
import numpy as np

customers = pd.DataFrame([[100, 'Prometheus Barwis', 'prometheus.barwis@me.com',
        '(533) 072-2779'],[101, 'Alain Hennesey', 'alain.hennesey@facebook.com',
        '(942) 208-8460'],[102, 'Chao Peachy', 'chao.peachy@me.com',
        '(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
        'somtochukwu.mouritsen@me.com','(669) 504-8080'],[104,
        'Elisabeth Berry', 'elisabeth.berry@facebook.com','(802) 973-8267']],
        columns = ['customer_id', 'name', 'email', 'phone'])

orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
       [1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
       [1005, 101,  36.69], [1006, 104,  39.59], [1007, 104, 430.94],
       [1008, 103,  31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
       [1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
       [1014, 101, 309.53], [1015, 102, 299.19]],
       columns = ['order_id', 'customer_id', 'order_total'])

当我按 customer_id 和 order_id 分组时，我得到下表：

customer_id  order_id  order_total

100           1000       144.82
              1001       140.93
              1003       194.60
              1004       307.72
              1013       423.77
101           1005       36.69
              1011       256.20
              1014       309.53
102           1002       104.26
              1010       383.35
              1015       299.19
103           1008       31.40
              1012       930.56
104           1006       39.59
              1007       430.94
              1009       180.69

这就是我卡住的地方。我不知道如何汇总每个 customer_id 的所有订单以创建一个 total_spent 列。如果有人知道这样做的方法，将不胜感激！

【问题讨论】：

您的分组似乎超出了必要的级别 - 您是如何做到的？你最终追求的是什么？是不是类似于：customers['total_spend'] = customers['customer_id'].map(orders.groupby('customer_id')['order_total'].sum())？
我通过customer_spend = pd.merge(customers, orders) customer_spend.groupby(["customer_id", 'order_id']).sum() 得到了上面的表格最终我想要一张决赛桌，它会给我 customer_id、姓名以及那个人一起花了多少钱（因此新的 total_spend 列）
上面不是这样吗？
你的问题和答案已找到here
您的预期输出是什么？

标签： python-3.x pandas numpy dataframe

【解决方案1】：

IIUC，您可以执行以下操作

orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')

输出

customer_id     Customer_Total
0   100     1211.84
1   101     602.42
2   102     786.80
3   103     961.96
4   104     651.22

【讨论】：

【解决方案2】：

你可以创建一个额外的表然后merge回到你当前的输出。

# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()

# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()

# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]

# merge bcak to your current table 
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})

# expect output
df

【讨论】：

【解决方案3】：

df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)

结果：

                                    total_spend
customer_id name    
100         Prometheus Barwis       1211.84
103         Somtochukwu Mouritsen   961.96
102         Chao Peachy             786.80
104         Elisabeth Berry         651.22
101         Alain Hennesey          602.42

一步一步的解释：

首先使用左连接将您的 orders 表合并到您的 customers 表中。为此，您将需要 pandas 的 .merge() 方法。请务必将 how 参数设置为 left，因为默认合并类型是 inner（这将忽略没有订单的客户）。

这一步需要对 SQL 风格的合并方法有一些基本的了解。您可以在 this thread 中找到各种合并类型的良好可视化概览。
您可以使用 .filter() 方法附加合并，以仅保留您感兴趣的列（在您的情况下：customer_id、name 和 order_total em>)。
现在您已经有了合并表，我们仍然需要汇总每个客户的所有 order_total 值。为此，我们需要使用.groupby() 对所有非数字列进行分组，然后对剩余的数字列应用聚合方法（在本例中为.sum()）。

上面的.groupby() 文档链接提供了更多关于此的示例。值得一提的是，这是一种在 pandas 文档中称为“split-apply-combine”的模式。
接下来，您需要使用.rename() 方法并设置其column 参数，将数字列从order_total 重命名为total_spend。
最后但同样重要的是，使用.sort_values()，按您的total_spend 列对您的客户进行排序。

希望对你有帮助。

【讨论】：