【问题标题】:Retrieve all rows for specific text after grouping by application and user id按应用程序和用户 ID 分组后检索特定文本的所有行
【发布时间】:2019-04-24 07:11:19
【问题描述】:

当用户以数字方式完成一个步骤时,is_digitally_signed 将变为 YES。 我正在尝试做的事情:如果任何步骤以数字方式完成,我想检索相同application_iduser_id 的所有行。请检查我想要的输出下方。

复制我的数据集的 R 代码

df <- data.table(application_id = c(1,1,1,2,2,2,3,3,3), 
                 user_id = c(123,123,123,456,456,456,789,789,789), 
                 application_status = c("incomplete", "details_verified", "complete"),
                 date = c("01/01/2018", "02/01/2018", "03/01/2018"),
                 is_digitally_signed = c("NULL", "NULL", "YES", "NULL", "NULL", "NULL", "NULL", "YES", "NULL")) %>%
  mutate(date = as.Date(date, "%d/%m/%Y"))

带输出

df
  application_id user_id application_status       date is_digitally_signed
              1     123         incomplete  2018-01-01                NULL
              1     123   details_verified  2018-01-02                NULL
              1     123           complete  2018-01-03                 YES
              2     456         incomplete  2018-01-01                NULL
              2     456   details_verified  2018-01-02                NULL
              2     456           complete  2018-01-03                NULL
              3     789         incomplete  2018-01-01                NULL
              3     789   details_verified  2018-01-02                 YES
              3     789           complete  2018-01-03                NULL

我的(不成功的)努力

df %>% group_by(application_id,user_id) %>% filter_all(all.vars(. == "YES"))

期望的结果

application_id user_id application_status       date is_digitally_signed
              1     123         incomplete 2018-01-01                NULL
              1     123   details_verified 2018-01-02                NULL
              1     123           complete 2018-01-03                 YES
              3     789         incomplete 2018-01-01                NULL
              3     789   details_verified 2018-01-02                 YES
              3     789           complete 2018-01-03                NULL

【问题讨论】:

    标签: r dplyr data-manipulation


    【解决方案1】:

    dplyr

    我们可以将filterany 一起使用,这将检查给定组中是否至少有一条带有is_digitally_signed == 'YES' 的记录:

    library(dplyr)
    
    df %>% 
      group_by(application_id, user_id) %>%
      filter(any(is_digitally_signed == "YES"))
    

    或使用all 函数对不是所有is_digitally_signed == "NULL" 的组进行子集化:

    df %>% 
      group_by(application_id, user_id) %>%
      filter(!all(is_digitally_signed == "NULL"))
    

    数据表

    我们也可以使用data.table,因为您已经将数据加载为 DT:

    library(data.table)
    dt = setDT(df)
    dt[dt[,.I[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]$V1,]
    

    .SD:

    dt[,.SD[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]
    

    输出:

    # A tibble: 6 x 5
    # Groups:   application_id, user_id [2]
      application_id user_id application_status date       is_digitally_signed
               <dbl>   <dbl> <fct>              <date>     <fct>              
    1              1     123 incomplete         2018-01-01 NULL               
    2              1     123 details_verified   2018-01-02 NULL               
    3              1     123 complete           2018-01-03 YES                
    4              3     789 incomplete         2018-01-01 NULL               
    5              3     789 details_verified   2018-01-02 YES                
    6              3     789 complete           2018-01-03 NULL
    

    【讨论】:

      【解决方案2】:

      由于只有一列要测试,我们可以简单地使用filterany

      library(dplyr)
      df %>% 
         group_by(application_id,user_id) %>% 
          filter(any(is_digitally_signed  == "YES"))
      # A tibble: 6 x 5
      # Groups:   application_id, user_id [2]
      #  application_id user_id application_status date       is_digitally_signed
      #           <dbl>   <dbl> <chr>              <date>     <chr>              
      #1              1     123 incomplete         2018-01-01 NULL               
      #2              1     123 details_verified   2018-01-02 NULL               
      #3              1     123 complete           2018-01-03 YES                
      #4              3     789 incomplete         2018-01-01 NULL               
      #5              3     789 details_verified   2018-01-02 YES                
      #6              3     789 complete           2018-01-03 NULL               
      

      或者另一种选择是使用%in% 来返回一个被回收的TRUE/FALSE 输出

      df %>% 
         group_by(application_id,user_id) %>% 
         filter("YES" %in% is_digitally_signed)
      

      或者我们可以使用base R

      df[with(df, ave(is_digitally_signed == "YES", application_id,user_id, FUN = any)),]
      

      【讨论】:

        最近更新 更多