根据重复客户的百分比创建列答案

【问题标题】：Create column based on percentage of recurring customers根据重复客户的百分比创建列
【发布时间】：2021-12-15 21:59:22
【问题描述】：

我有一个 DataFrame，其中包含按行指定的订单数据。所以每一行都是不同的顺序。

创建日期
customer_id
总价值
recurring_customer

第三次订购的客户是常客。我想知道回头客占总价值的百分比。

DataFrame 如下所示：

df = pd.DataFrame(
    {
        "date_created" ["2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16"]
        "customer_id": ["1733", "6356", "6457", "6599", "6637", "6638"],
        "total": ["746.02", "1236.60", "1002.32", "1187.21", "1745.03", "2313.14"],
        "recurring_customer": ["False", "False", "False", "False", "False", "False"],
    }
)

通过将数据重采样为月度数据：

df_monthly = df.resample('1M').mean()

我得到以下输出：

df_monthly = pd.DataFrame(
    {
        "date_created": ["2019-11-30", "2019-12-31", "2020-01-31", "2020-02-29", "2020-03-31", "2020-04-30"]
        "customer_id": ["4987.02", "5291.56", "5702.13", "6439.27", "7263.11", "8080.91",],
        "total": ["2915.25", "2550.85", "2486.72", "2515.81", "2633.77", "2558.19"],
        "recurring_customer": ["0.009050", "0.016667", "0.075630", "0.138122", "0.130045", "0.175503"],
    }
)

所以，真正的问题是我想找出回头客占当月总价值的百分比。

所需的输出应如下所示：

| date_created | customer_id | total   | recurring_customer | recurring_customer_total | recurring_customer_total_percentage | 
| ------------ | ----------- | ------  | ------------------ | ------------------------ | ----------------------------------- |
|  2019-11-30  |  4987.02    | 2915.25 |       0.009050     |         ??????           |        ??????
|  2019-12-31  |  5291.56    | 2550.85 |       0.016667     |         ??????           |        ??????
|  2020-01-31  |  5702.13    | 2486.72 |       0.075630     |         ??????           |        ??????
|  2020-02-29  |  6439.27    | 2515.81 |       0.138122     |         ??????           |        ??????
|  2020-03-31  |  7263.11    | 2633.77 |       0.130045     |         ??????           |        ??????
|  2020-04-30  |  8080.91    | 2558.19 |       0.175503     |         ??????           |        ??????

请注意，我不能只计算 recurring_customer 百分比乘以总价值，因为我假设经常性客户组对总价值的贡献比非经常性客户的客户多得多。

我在每日数据帧上尝试了 np.where() 函数，其中：

我会在每日数据框中创建一个“recurring_customer_total”列，它会复制“total”列的值，但仅当“recurring_customer”返回 True 时，否则返回 0。我在这里发现了一个类似的问题：get values from first column when other columns are true using a lookup list .另一个类似的问题在这里被问到： Getting indices of True values in a boolean list。这个答案返回所有“真”值和它的位置，我想要 'total' 的值复制到 'recurring_customer_total' 时 'recurring_customer' 是 'True'。
然后，我会将每日数据帧重新采样为每月数据帧，这将给出“recurring_customers”对总价值的贡献量的平均值。这些值将显示在“recurring_customers_total”中。
最后一步是根据“total”列计算“recurring_customer_total”的百分比。这些值应存储在“recurrings_customer_total_percentage”中。

我认为这些是我需要遵循的步骤，唯一的问题是我真的不知道如何到达那里。

提前致谢！

【问题讨论】：

您好，我在df 中看不到任何datetime 列或索引，因此无法重现您的步骤。 df.head(6).to_dict() 的实际输出是多少？
嗨@laurent，我添加了 date_created 列（它是日期时间中的索引）。现在可以复现了吗？如果你愿意，我可以给你看 df.head(6).to_dict() 的输出吗？

标签： python pandas dataframe

【解决方案1】：

所以我对 Python 还很陌生，但我已经设法回答了我自己的问题。不能说这是最好、最简单、最快的方法，但它确实有帮助。

首先，我创建了一个新数据框，它是原始数据框的精确副本，但仅包含“recurring_customer”列的“True”值。我通过使用以下代码做到了这一点：

df_recurring_customers = df.loc[df['recurring_customer'] == True]

它给了我以下数据框：

df_recurring_customers.head()
    {
        "date_created" ["2019-11-25", "2019-11-28", "2019-12-02", "2019-12-09", "2019-12-11"]
        "customer_id": ["577", "6457", "577", "6647", "840"],
        "total": ["33891.12", "81.98", "9937.68", "1166.28", "2969.60"],
        "recurring_customer": ["True", "True", "True", "True", "True"],
    }
)

然后我使用以下方法重新采样值：

df_recurring_customers_monthly_sum = df_recurring_customers.resample('1M').sum()

然后我删除了没有价值的“数字”和“客户 ID”列。下一步是使用以下方法加入两个数据框“df_monthly”和“df_recurring_customers_monthly_sum”：

df_total = df_recurring_customers_monthly_sum.join(df_monthly)

这给了我：

| date_created | total      | recurring_customer_total |
| ------------ | ---------- | ------------------------ |
|  2019-11-30  | 644272.02  |         33973.10         |
|  2019-12-31  | 612205.99  |         15775.29         |
|  2020-01-31  | 887761.60  |         61612.27         |
|  2020-02-29  | 910724.75  |         125315.31        |
|  2020-03-31  | 1174662.59 |         125315.31        |
|  2020-04-30  | 1399332.26 |         248277.97        |

然后我想知道百分比所以

df_total['total_recurring_customer_percentage'] = (df_total['recurring_customer_total'] / df_total['total']) * 100

这给了我：

| date_created | total      | recurring_customer_total | recurring_customer_total_percentage | 
| ------------ | ---------- | ------------------------ | ----------------------------------- |
|  2019-11-30  | 644272.02  |         33973.10         |        5.273099
|  2019-12-31  | 612205.99  |         15775.29         |        2.576794
|  2020-01-31  | 887761.60  |         61612.27         |        6.940182
|  2020-02-29  | 910724.75  |         125315.31        |        13.759954
|  2020-03-31  | 1174662.59 |         125315.31        |        13.967221
|  2020-04-30  | 1399332.26 |         248277.97        |        17.742603

【讨论】：