【问题标题】:Flag non-consecutive values by group in r在 r 中按组标记非连续值
【发布时间】:2021-09-08 00:30:32
【问题描述】:

我有一个由多个组组成的数据集,这些组具有连续的编号箱(每个组中的箱数不一定相同)和一个布尔存在/不存在值。我希望能够生成一些输出,指示其中存在非连续“当前”值的组。

一个最小的代表应该是这样的:

x <- NULL
x$group <- c(rep("A",4),rep("B", 5), rep("C",4))
x$bin <- c(1,2,3,4,1,2,3,4,5,1,2,3,4)
x$status <- c("absent", "present", "absent", "present", "absent", "present", "present", "absent", "absent", "absent", "absent", "present", "present")

as.data.frame(x)

   group bin  status
1      A   1  absent
2      A   2 present
3      A   3  absent
4      A   4 present
5      B   1  absent
6      B   2 present
7      B   3 present
8      B   4  absent
9      B   5  absent
10     C   1  absent
11     C   2  absent
12     C   3 present
13     C   4 present

输出可能是同一数据框中带有标志的另一列,

   group bin  status flag
1      A   1  absent    1
2      A   2 present    1
3      A   3  absent    1
4      A   4 present    1
5      B   1  absent    0
6      B   2 present    0
7      B   3 present    0
8      B   4  absent    0
9      B   5  absent    0
10     C   1  absent    0
11     C   2  absent    0
12     C   3 present    0
13     C   4 present    0

一个单独的数据框或矩阵,如:

  group  flag
1     A  TRUE
2     B FALSE
3     C FALSE

或列表:

> flagged_groups
[1] "A"

我觉得写这篇文章我已经整理出了一些我必须做的事情,但我很想听听你的想法,以一种简洁(和整洁)的方式来提炼我的数据。

【问题讨论】:

    标签: r dataframe dplyr


    【解决方案1】:

    你可以这样做:

    library(dplyr)
    
    df %>%
      group_by(group) %>%
      mutate(flag = +any(diff(row_number()[status == "present"]) != 1))
    
    # A tibble: 14 x 4
    # Groups:   group [4]
       group   bin status   flag
       <chr> <dbl> <chr>   <int>
     1 A         1 absent      1
     2 A         2 present     1
     3 A         3 absent      1
     4 A         4 present     1
     5 B         1 absent      0
     6 B         2 present     0
     7 B         3 present     0
     8 B         4 absent      0
     9 B         5 absent      0
    10 C         1 absent      0
    11 C         2 absent      0
    12 C         3 present     0
    13 C         4 present     0
    

    【讨论】:

    • 谢谢。这非常简单,它适用于我测试的一些边缘情况。我以前没见过 +any 但我喜欢它。
    【解决方案2】:

    数据

    df <-
    structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B", 
    "B", "C", "C", "C", "C"), bin = c(1, 2, 3, 4, 1, 2, 3, 4, 5, 
    1, 2, 3, 4), status = c("absent", "present", "absent", "present", 
    "absent", "present", "present", "absent", "absent", "absent", 
    "absent", "present", "present")), class = "data.frame", row.names = c(NA, 
    -13L))
    

    摘要数据帧

    代码

    flagged_df <-
      df %>% 
      #Grouping by variable group
      group_by(group) %>% 
      #Create auxiliary variable to check if there is a consecutive present in status
      summarise(flag = sum(if_else(status == lag(status) & status == "present",1,0))) %>% 
      # Creating a boolean variable 
      mutate(flag = if_else(flag == 0,TRUE,FALSE))
    

    输出

    # A tibble: 3 x 2
      group flag 
      <chr> <lgl>
    1 A     TRUE 
    2 B     FALSE
    3 C     FALSE 
    

    在原始data.frame中添加列标志

    代码

    df %>% 
      left_join(
        flagged_df
      ) %>% 
      mutate(flag = as.numeric(flag))
    

    输出

    Joining, by = "group"
       group bin  status flag
    1      A   1  absent    1
    2      A   2 present    1
    3      A   3  absent    1
    4      A   4 present    1
    5      B   1  absent    0
    6      B   2 present    0
    7      B   3 present    0
    8      B   4  absent    0
    9      B   5  absent    0
    10     C   1  absent    0
    11     C   2  absent    0
    12     C   3 present    0
    13     C   4 present    0
    

    【讨论】:

      【解决方案3】:

      使用rle可以检查数据中是否至少有2个"present"的值,并且没有一个是连续的。

      library(dplyr)
      
      check_flag <- function(status) {
        with(rle(status == 'present'), sum(values) > 1 && all(lengths[values] < 2))  
      }
      
      x %>%
        group_by(group) %>%
        summarise(flag = check_flag(status))
      
      #  group flag 
      #  <chr> <lgl>
      #1 A     TRUE 
      #2 B     FALSE
      #3 C     FALSE
      

      要获得具有 1/0 值的新列,您可以使用

      x %>%  group_by(group) %>% mutate(flag = +check_flag(status))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-04-20
        • 1970-01-01
        • 1970-01-01
        • 2022-11-15
        • 2019-12-18
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多