【问题标题】:how to calculate unique count using dcast in R如何在 R 中使用 dcast 计算唯一计数
【发布时间】:2020-11-15 10:22:25
【问题描述】:

我正在使用 dcast 转置下表

date               event          user_id
25-07-2020         Create          3455
25-07-2020         Visit           3567
25-07-2020         Visit           3567
25-07-2020         Add             3567
25-07-2020         Add             3678
25-07-2020         Add             3678
25-07-2020         Create          3567
24-07-2020         Edit            3871

我正在使用 dcast 转置以将我的事件作为列并计算 user_id

dae_summ <- dcast(ahoy_events, date ~ event, value.var="user_id")

但我没有获得 唯一 用户 ID。它多次计算相同的user_id。我该怎么做才能使同一日期和事件的一个 user_id 只计算一次。

【问题讨论】:

    标签: r transpose dcast


    【解决方案1】:

    使用reshape 的基本 R 选项

    out <- replace(
      u <- reshape(
        unique(transform(ahoy_events, user_id = ave(user_id, event, date, FUN = function(x) length(unique(x))))),
        direction = "wide",
        idvar = "date",
        timevar = "event"
      ),
      is.na(u),
      0
    )
    

    这样

    > out
            date user_id.Create user_id.Visit user_id.Add user_id.Edit
    1 25-07-2020              2             1           2            0
    8 24-07-2020              0             0           0            1
    

    数据

      "25-07-2020", "25-07-2020", "25-07-2020",
      "25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
    ), event = c(
      "Create", "Visit", "Visit", "Add", "Add", "Add",
      "Create", "Edit"
    ), user_id = c(
      3455L, 3567L, 3567L, 3567L, 3678L,
      3678L, 3567L, 3871L
    )), class = "data.frame", row.names = c(
      NA,
      -8L
    ))
    

    【讨论】:

      【解决方案2】:

      使用reshape2 包,您可以使用以下内容:

      library(reshape2)
      

      数据:

      zz <- "date               event          user_id
             25-07-2020         Create          3455
             25-07-2020         Visit           3567
             25-07-2020         Visit           3567
             25-07-2020         Add             3567
             25-07-2020         Add             3678
             25-07-2020         Add             3678
             25-07-2020         Create          3567
             24-07-2020         Edit            3871"
      data <- read.table(text=zz, header = TRUE)
      

      代码:

      data %>% 
        dcast(user_id ~ event, value.var="user_id",fun.aggregate = function(x) length(unique(x)))
      

      输出:

      date         Add     Create      Edit      Visit
      <fctr>       <int>   <int>       <int>     <int>
      24-07-2020   0       0           1         0
      25-07-2020   2       2           0         1
      

      reprex package (v0.3.0) 于 2020 年 7 月 25 日创建

      【讨论】:

        【解决方案3】:

        我们可以使用来自data.tableuniqueN

        library(data.table)
        dcast(setDT(ahoy_events), date ~ event, fun.aggregate = uniqueN)
        #         date Add Create Edit Visit
        #1: 24-07-2020   0      0    1     0
        #2: 25-07-2020   2      2    0     1
        

        或使用tidyr 中的pivot_wider 并将values_fn 指定为n_distinct

        library(tidyr)
        library(dplyr)
        ahoy_events %>%
           pivot_wider(names_from = event, values_from = user_id, 
              values_fn = list(user_id = n_distinct), values_fill = list(user_id = 0))
        # A tibble: 2 x 5
        #   date       Create Visit   Add  Edit
        #  <chr>       <int> <int> <int> <int>
        #1 25-07-2020      2     1     2     0
        #2 24-07-2020      0     0     0     1
        

        数据

        ahoy_events <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
        "25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
        ), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
        "Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
        3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
        -8L))
        

        【讨论】:

          【解决方案4】:

          你可以试试:

          library(reshape2)
          
          #Data
          df <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
          "25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
          ), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
          "Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
          3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
          -8L))
          
          #New code
          dae_summ <- dcast(df, date ~ event,  value.var="user_id",fun.aggregate = function(x) length(unique(x)))
          
                  date Add Create Edit Visit
          1 24-07-2020   0      0    1     0
          2 25-07-2020   2      2    0     1
          

          您的代码会产生这样的结果:

                  date Add Create Edit Visit
          1 24-07-2020   0      0    1     0
          2 25-07-2020   3      2    0     2
          

          所以还是有区别的。

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多