【问题标题】:How to Transform Data to Find Index with Same Value如何转换数据以找到具有相同值的索引
【发布时间】:2017-10-04 14:26:51
【问题描述】:

我打算找买过一模一样产品的客户,

我拥有的数据是客户的行为——他们购买了什么。

我提供的示例是我的数据的简化版本。客户通常会购买 10 到 20 种产品。消费者可以选择购买大约 50 种产品。

我真的很困惑,什么是将我的数据转换为我喜欢的输出的简单方法。 你能给我什么建议吗?谢谢

输入:

structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
"collector")), Products = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

输出

structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought", 
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character", 
"collector")), Customer_ID = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Products Bought", "Customer_ID")), 
    default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

【问题讨论】:

  • 请使用dput 显示一个小的可重现性,并基于它而不是图像显示预期输出

标签: r data-manipulation data-cleaning data-processing


【解决方案1】:

我怀疑您可能希望以一种更有用的方式来构建您的数据。在任何情况下,tidyverse 都可以成为您思考任务的有用方式。

如前所述,为其他人发布代码可以节省他们的时间并更快地为您提供答案。

library(dplyr)
library(stringr)
library(tidyr)

d <- data_frame(id=c(1,2,3,4,5,6)
     , bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
               , 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))

d %>% 
## Unnest the values & take care of white space
## - This is the better data structure to have, anyways
mutate(buy=str_split(bought,',')) %>% 
unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%

## Get distinct (and sort?)
distinct(id, buy) %>% arrange(id, buy) %>%

## Aggregate by id
group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%

## Count
group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup

编辑:引用 this SO post 以在 dplyr 中更快/更清晰地获得不同的组合

【讨论】:

    【解决方案2】:

    使用给定的input 数据和data.table,这可以写成(相当复杂的)“单线”:

    dcast(unique(setDT(input)[, strsplit(Products, ", "), Customer_ID])[
      order(Customer_ID, V1)], 
      Customer_ID ~ ., paste, collapse = ", ")[
        , .(Customers = paste(Customer_ID, collapse = ", ")), .(Products = .)]
    #              Products Customers
    #1: Apple, Beer, Diaper   1, 3, 6
    #2:         Apple, Beer         2
    #3:       Apple, Diaper      4, 5
    

    请注意,OP 已删除第二行,其中只有一位客户 预期的输出,但没有提到过滤问题中输出的任何标准。

    输入数据

    (由 OP 给出):

    input <- structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
    "Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
    "Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
    "Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
    "collector")), Products = structure(list(), class = c("collector_character", 
    "collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-11-12
      • 2021-10-04
      • 2018-10-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多