【问题标题】:How to Filter out Rows per Group after Condition Occurrs条件发生后如何过滤掉每组的行
【发布时间】:2018-10-11 18:02:49
【问题描述】:

我是 R 编程新手,并尝试在满足过滤条件后删除一组行中的某些行。

场景:对于每个 GROUP,如果 2 个 TYPE "B" 在一行中,则删除该 GROUP 的所有以下行。 “Include in DataSet”列显示输出应该是什么。

这是我的示例输入:

GROUP   TYPE    Include in DataSet?
--------------------------------------------
1       A       yes
1       A       yes
1       B       yes
1       B       yes
1       B       no
2       A       yes
2       B       yes
2       B       yes
2       A       no
2       B       no
2       B       no

DF = structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
"B", "B"), inc = c("yes", "yes", "yes", "yes", "no", "yes", "yes", 
"yes", "no", "no", "no")), .Names = c("GROUP", "TYPE", "inc"), row.names = c(NA, 
-11L), class = "data.frame")

预期输出:

GROUP   TYPE    Include in DataSet?
--------------------------------------------
1       A       yes
1       A       yes
1       B       yes
1       B       yes
2       A       yes
2       B       yes
2       B       yes

我尝试编写了一些代码,但由于分组问题没有成功。

i=1
j=2
x <- allrows
for (i in x){
  for(j in x){
    if(i==j){
      a$REMOVE=1
    }
    else{
      a$REMOVE=2
    }
  }
}

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    您可以通过创建一个标识“双 B”行的新变量来做到这一点,然后过滤掉组中第一个“双 B”行之后的行:

    library(dplyr)
    df %>%
        group_by(GROUP) %>%
        # Create new variable that tests if each row and the one below it TYPE==B
        mutate(double_B = (TYPE == 'B' & lag(TYPE) == 'B')) %>%
        # Find the first row with `double_B` in each group, filter out rows after it
        filter(row_number() <= min(which(double_B == TRUE))) %>%
        # Optionally, remove `double_B` column when done with it
        select(-double_B)
    
    # A tibble: 7 x 3
    # Groups:   GROUP [2]
      GROUP TYPE  IncludeinDataSet
      <int> <chr> <chr>           
    1     1 A     yes             
    2     1 A     yes             
    3     1 B     yes             
    4     1 B     yes             
    5     2 A     yes             
    6     2 B     yes             
    7     2 B     yes       
    

    正如@Frank 在评论中指出的那样,您不需要创建double_B 变量:您只需在which 内的which 语句中测试“双B”条件@:

    df %>%
        group_by(GROUP) %>%
        # Find the first row with `double_B` in each group, filter out rows after it
        filter(row_number() <= min(which(TYPE == 'B' & lag(TYPE) == 'B')))
    

    此外,如果在组中未找到“双 B”条件,它将返回警告,但仍会正确过滤

    【讨论】:

    • Re 是“可选”,另一种方法是使用条件而不指定名称:df %&gt;% group_by(GROUP) %&gt;% filter(row_number() &lt;= min(which(TYPE == 'B' &amp; lag(TYPE) == 'B')))。顺便说一句,如果从来没有双 B,你会收到警告(例如,尝试min(which(FALSE))),但我不确定是否有办法解决这个问题。
    【解决方案2】:

    这可以通过检查'TYPE'的当前值和'TYPE'的下一个值来找到数字索引,使用seq_len获取从1到该数字的序列以对行进行子集(在@ 987654322@)

    library(dplyr)
    df1 %>% 
      group_by(GROUP) %>% 
      slice(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 1))
    # A tibble: 7 x 3
    # Groups:   GROUP [2]
    #  GROUP TYPE  IncludeInDataSet
    #  <int> <chr> <chr>           
    #1     1 A     yes             
    #2     1 A     yes             
    #3     1 B     yes             
    #4     1 B     yes             
    #5     2 A     yes             
    #6     2 B     yes             
    #7     2 B     yes          
    

    数据

    df1 <- structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
     2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
     "B", "B"), IncludeInDataSet = c("yes", "yes", "yes", "yes", "no", 
      "yes", "yes", "yes", "no", "no", "no")), class = "data.frame", 
     row.names = c(NA, -11L))
    

    【讨论】:

      【解决方案3】:

      另一种方法可能是:

      library(dplyr)
      library(data.table)
      
      df %>%
        group_by(GROUP, rleid(TYPE)) %>%
        mutate(temp = seq_along(TYPE)) %>%
        ungroup() %>%
        group_by(GROUP) %>%
        filter(row_number() <= min(which(TYPE == "B" & temp == 2))) %>%
        select(GROUP, TYPE, IncludeInDataSet)
      

      【讨论】:

        【解决方案4】:

        这是一个基本的 R 解决方案:

        subset(DF, as.logical(ave(DF$TYPE,DF$GROUP, FUN= function(x) 
          seq_along(x) <= which((sequence(rle(x=="B")$length) * (x=="B")) %in% 2)[1])))
        #   GROUP TYPE inc
        # 1     1    A yes
        # 2     1    A yes
        # 3     1    B yes
        # 4     1    B yes
        # 6     2    A yes
        # 7     2    B yes
        # 8     2    B yes
        

        【讨论】:

          猜你喜欢
          • 2022-12-21
          • 2023-03-15
          • 2022-01-23
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2023-01-19
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多