【发布时间】:2021-03-07 05:51:02
【问题描述】:
我在 R 中有两个数据框,一个由产品 SKU(产品 ID)列表组成,另一个是包含订单号、客户电子邮件、购买日期和产品 ID(product_sku)的购买日志和购买的数量。
purchases_dataframe:
order_number | email | product_sku | quantity | purchase_date
1000 |customer1@sample.com | RT-100 | 2 | 2020-01-01
1000 |customer1@sample.com | CT-300 | 1 | 2020-01-01
1000 |customer1@sample.com | Phone-100 | 1 | 2020-01-01
2000 |customer2@sample.com | Phone-200 | 1 | 2020-04-20
2000 |customer2@sample.com | OM-200 | 1 | 2020-04-20
3000 |customer3@sample.com | CT-300 | 3 | 2020-03-15
4000 |customer1@sample.com | OM-200 | 5 | 2020-07-07
5000 |customer4@sample.com | Phone-200 | 3 | 2020-08-19
6000 |customer3@sample.com | Phone-100 | 1 | 2020-09-22
6000 |customer3@sample.com | RT-100 | 1 | 2020-09-22
tv_list:
SKU
RT-100
CT-300
OM-200
LL-400
...
我想计算客户在其一生中购买的电视总数,而忽略所有其他产品(例如手机)。数据框 tv_list 应该可以帮助我识别哪些 SKU 是电视,哪些不是,因为我有各种不同的电视 SKU,以上只是一个较小的示例。 理想情况下,生成的数据框如下所示:
email | number_purchased_tv
customer1@sample.com | 8
customer2@sample.com | 1
customer3@sample.com | 4
customer4@sample.com | 0
为了重现性和为了更容易理解我的示例,这里是上面两个 sample_tables 的代码:
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
"customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
"customer3@sample.com","customer3@sample.com"),
product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
quantity = c(2,1,1,1,1,3,5,3,1,1),
purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
非常感谢!
【问题讨论】:
标签: r dplyr aggregate tidyr summarize