【问题标题】:Select first row per run by group [duplicate]按组选择每次运行的第一行[重复]
【发布时间】:2021-05-22 13:16:28
【问题描述】:

我有一个分组变量(ID)和一些值(类型)的数据:

ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type <- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")

dat <- data.frame(ID,type)

在每个ID中,我想删除重复的数字,不是唯一的,而是与前一个相同的。我已经注释了一些例子:

#     ID type
#  1   1    1
#  2   1    3 # first value in a run of 3s within ID 1: keep 
#  3   1    3 # 2nd value: remove  
#  4   1    2
#  5   2    3
#  6   2    3
#  7   2    1
#  8   2    1
#  9   3    1
# 10   3    2 # first value in a run of 2s within ID 3: keep
# 11   3    2 # 2nd value: remove
# 12   3    1

例如ID 3的值序列为1,2,2,1。第三个值与第二个值相同,所以应该删除它,变成1,2,1

因此,所需的输出是:

data.frame(ID = c("1", "1", "1", "2", "2", "3", "3", "3"),
           type = c("1", "3", "2", "3", "1", "1", "2", "1"))

  ID type
1  1    1
2  1    3
3  1    2
4  2    3
5  2    1
6  3    1
7  3    2
8  3    1

我试过了

 df[!duplicated(df), ]

然而我得到的是

ID <- c("1", "1", "1", "2", "2", "3", "3")
type<- c("1", "3", "2", "3", "1", "1", "2")

我知道重复只会保留唯一的。我怎样才能得到我想要的值?

提前感谢您的帮助!

【问题讨论】:

    标签: r duplicates sequence run-length-encoding


    【解决方案1】:

    这行得通吗:

    library(dplyr)
    dat %>% group_by(ID) %>% 
       mutate(flag = case_when(type == lag(type) ~ TRUE, TRUE ~ FALSE)) %>% 
         filter(!flag) %>% select(-flag)
    # A tibble: 8 x 2
    # Groups:   ID [3]
      ID    type 
      <chr> <chr>
    1 1     1    
    2 1     3    
    3 1     2    
    4 2     3    
    5 2     1    
    6 3     1    
    7 3     2    
    8 3     1   
    

    【讨论】:

    • 是的,它有效 :) 我正在尝试了解您的想法。
    【解决方案2】:

    使用data.table rleidduplicated -

    library(data.table)
    setDT(dat)[!duplicated(rleid(ID, type))]
    
    #   ID type
    #1:  1    1
    #2:  1    3
    #3:  1    2
    #4:  2    3
    #5:  2    1
    #6:  3    1
    #7:  3    2
    #8:  3    1
    

    改进的答案包括来自@Henrik 的建议。

    【讨论】:

      【解决方案3】:

      Base R方式如果你只想消除连续的重复行(8行输出)

      ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
      type<- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")
      
      dat <- data.frame(ID,type)
      
      subset(dat, !duplicated(with(rle(paste(dat$ID, dat$type)), rep(seq_len(length(lengths)), lengths))))
      #>    ID type
      #> 1   1    1
      #> 2   1    3
      #> 4   1    2
      #> 5   2    3
      #> 7   2    1
      #> 9   3    1
      #> 10  3    2
      #> 12  3    1
      

      reprex package (v2.0.0) 于 2021-05-22 创建

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-05-07
        • 1970-01-01
        • 1970-01-01
        • 2012-10-28
        • 1970-01-01
        • 2016-02-25
        • 2021-05-27
        相关资源
        最近更新 更多