【问题标题】:How to summarize number of products purchased by a customer based on a product list in R如何根据R中的产品列表汇总客户购买的产品数量
【发布时间】:2021-03-07 05:51:02
【问题描述】:

我在 R 中有两个数据框,一个由产品 SKU(产品 ID)列表组成,另一个是包含订单号、客户电子邮件、购买日期和产品 ID(product_sku)的购买日志和购买的数量。

purchases_dataframe:

order_number | email                | product_sku | quantity | purchase_date
1000         |customer1@sample.com  | RT-100      | 2        | 2020-01-01
1000         |customer1@sample.com  | CT-300      | 1        | 2020-01-01
1000         |customer1@sample.com  | Phone-100   | 1        | 2020-01-01
2000         |customer2@sample.com  | Phone-200   | 1        | 2020-04-20
2000         |customer2@sample.com  | OM-200      | 1        | 2020-04-20
3000         |customer3@sample.com  | CT-300      | 3        | 2020-03-15
4000         |customer1@sample.com  | OM-200      | 5        | 2020-07-07
5000         |customer4@sample.com  | Phone-200   | 3        | 2020-08-19
6000         |customer3@sample.com  | Phone-100   | 1        | 2020-09-22
6000         |customer3@sample.com  | RT-100      | 1        | 2020-09-22

tv_list:

 SKU
    RT-100
    CT-300
    OM-200
    LL-400
    ...

我想计算客户在其一生中购买的电视总数,而忽略所有其他产品(例如手机)。数据框 tv_list 应该可以帮助我识别哪些 SKU 是电视,哪些不是,因为我有各种不同的电视 SKU,以上只是一个较小的示例。 理想情况下,生成的数据框如下所示:

email                | number_purchased_tv
customer1@sample.com | 8
customer2@sample.com | 1
customer3@sample.com | 4
customer4@sample.com | 0

为了重现性和为了更容易理解我的示例,这里是上面两个 sample_tables 的代码

purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
                      email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
                                "customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
                                "customer3@sample.com","customer3@sample.com"),
                      product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
                      quantity = c(2,1,1,1,1,3,5,3,1,1),
                      purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))

tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))

非常感谢!

【问题讨论】:

    标签: r dplyr aggregate tidyr summarize


    【解决方案1】:

    下面使用dplyr执行您的要求

    library(dplyr)
    library(data.table)
    purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
    # A tibble: 4 x 2
    email                sumtv
    <chr>                <dbl>
      1 customer1@sample.com     8
    2 customer2@sample.com     1
    3 customer3@sample.com     4
    4 customer4@sample.com     0
    

    编辑请在上面找到关于sumtv 数字的更正和下面的data.table 解决方案

    library(dplyr)
    library(data.table)
    purchase_datatable <- purchase_dataframe
    purchase_datatable %>% setDT
    > purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
      +   ,.(email, sumtv)] %>% unique
    email sumtv
    1: customer1@sample.com     8
    2: customer2@sample.com     1
    3: customer3@sample.com     4
    4: customer4@sample.com     0
    

    microbenchmarking 为data.table 解决方案带来了近 50% 的优势,IMO 是一个非常值得通过这些vignettes 学习的优秀软件包

    library(microbenchmark)
    microbenchmark(purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
      ,.(email, sumtv)] %>% unique, purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
    )
    min      lq     mean  median     uq    max neval
    1.268 1.42700 1.823445 1.80300 2.0887 2.8332   100
    2.715 2.98025 3.250287 3.20355 3.3509 8.8255   100
    

    【讨论】:

    • 嘿!感谢您的快速答复!不幸的是,这不会导致正确的数量总和 - 似乎只汇总了订单。对于客户1,如果我没记错的话,总和应该是8。
    • 感谢您的快速修复!出于某种原因,data.table 解决方案给了我以下警告: In [.data.table(purchase_datatable, product_sku %in% unique(tv_list$SKU), : Invalid .internal.selfref 通过获取(浅)副本检测并修复data.table 以便 := 可以通过引用添加此新列。在较早的时候,此 data.table 已由 R 复制(或使用 structure() 或类似方法手动创建)。避免使用名称
    • 抱歉,由于“警告信息”,字符已用完。出于某种原因,它也没有返回“唯一”表——但这可能与上面的警告消息有关。
    • 您是否同时加载了dplyrdata.table?如果不是你应该这样做,我会编辑答案
    • 刚刚再次检查。如果我没记错的话,我认为 dplyr 解决方案有效。谢谢!但是,data.table 解决方案不会聚合客户(正如您在输出中看到的那样,客户 1 和客户 3 存在两次,具有 2 个不同的值)
    【解决方案2】:

    一个使用base R的选项:

    #Match and index
    purchase_dataframe$ProductIndex <- tv_list[match(purchase_dataframe$product_sku,tv_list$SKU),'SKU']
    purchase_dataframe$Counter <- ifelse(is.na(purchase_dataframe$ProductIndex),0,purchase_dataframe$quantity)
    #Aggregate
    Res <- aggregate(Counter~email,data=purchase_dataframe,sum,na.rm=T)
    

    输出:

                     email Counter
    1 customer1@sample.com       8
    2 customer2@sample.com       1
    3 customer3@sample.com       4
    4 customer4@sample.com       0
    

    【讨论】:

      【解决方案3】:

      这是您提供的数据:

      library('dplyr')
      purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
                            email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
                                      "customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
                                      "customer3@sample.com","customer3@sample.com"),
                            product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
                            quantity = c(2,1,1,1,1,3,5,3,1,1),
                            purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))
      
      tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))
      

      这将为您提供摘要但省略任何电子邮件(尚未购买电视的客户)

      total_tvs_by_cusomter <- purchase_dataframe %>%
        filter(product_sku %in% tv_list$SKU) %>%
        group_by(email) %>%
        mutate(quantity = as.numeric(quantity)) %>%
        summarise(number_purchased_tv = sum(quantity))
      

      结果:

      # A tibble: 3 x 2
        email                number_purchased_tv
        <chr>                              <dbl>
      1 customer1@sample.com                   8
      2 customer2@sample.com                   1
      3 customer3@sample.com                   4
      

      如果您想保留尚未购买电视的电子邮件/客户并将其添加为 0,以防万一

      total_tvs_by_cusomter <- left_join(unique(purchase_dataframe %>%
                  select(email)), total_tvs_by_cusomter)
      
      total_tvs_by_cusomter[is.na(total_tvs_by_cusomter)] <- 0
      

      结果:

                       email number_purchased_tv
      1 customer1@sample.com                   8
      2 customer2@sample.com                   1
      3 customer3@sample.com                   4
      4 customer4@sample.com                   0
      

      【讨论】:

        【解决方案4】:
        tv_purchases <-
        purchase_dataframe %>% 
          group_by(email) %>% 
          filter(product_sku %in% tv_list$SKU) %>%
          summarise(number_purchased_tv = sum(as.numeric(quantity)))
        
        ## join tv_purchases on distinct emails, to also have the 'customer4@sample.com     0' row
        
        purchase_dataframe %>%
          distinct(email) %>%
          left_join(tv_purchases) %>% ## emails which are not in tv_purchases will have NAs
          mutate(number_purchased_tv = case_when(is.na(number_purchased_tv) ~ 0, ## NAs become zeros
                                                 TRUE ~ number_purchased_tv) ## non-NAs stay as they are
                 )
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2021-03-17
          • 1970-01-01
          • 2022-10-17
          • 2013-09-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多