【问题标题】:Filter/merge based on closest value根据最接近的值过滤/合并
【发布时间】:2019-12-11 16:59:19
【问题描述】:

turbidity 值最接近我的数据框df 中的0.7 时,我想过滤time。然后我想利用这段时间来filter 为另外两个parameters 提供每组最近的时间。


示例

原始数据

print(df)

   sample   time parameter measurement
1   apple  0.000 turbidity       0.153
2   apple 13.805 turbidity       0.654
3   apple 16.586 turbidity       0.724 * Closest to 0.7
4   apple 25.354 turbidity       0.821
5   apple  0.000   glucose      34.100
6   apple 13.548   glucose      29.500
7   apple 17.254   glucose      17.300 ** Closest time when turbidity measurement is closest to 0.7
8   apple 24.893   glucose       4.100
9   apple  0.000  muconate       0.000
10  apple 13.412  muconate       3.500
11  apple 17.647  muconate       9.600 ** Closest time when turbidity measurement is closest to 0.7
12  apple 25.841  muconate      13.400
13 orange  0.000 turbidity       0.116
14 orange 12.655 turbidity       0.689 * Closest to 0.7
15 orange 14.214 turbidity       0.715
16 orange 32.687 turbidity       0.899
17 orange  0.000   glucose      35.600
18 orange 12.021   glucose      28.700 ** Closest time when turbidity measurement is closest to 0.7
19 orange 15.687   glucose      16.400
20 orange 33.641   glucose       3.700
21 orange  0.000  muconate       0.000
22 orange 13.365  muconate       3.200 ** Closest time when turbidity measurement is closest to 0.7
23 orange 18.259  muconate       8.500
24 orange 35.697  muconate      14.100

期望的输出

过滤turbidity 值最接近0.7 的行,但根据最接近的time 值将它们按sample 分组。

  sample    time parameter measurement
1 apple     16.6 turbidity       0.724
2 apple   17.254   glucose      17.300
3 apple   17.647  muconate       9.600
4 orange    12.7 turbidity       0.689
5 orange  12.021   glucose      28.700
6 orange  13.365  muconate       3.200

尝试失败

df %>% group_by(sample) %>%
    filter(parameter == "turbidity") %>%
    slice(which.min(abs(measurement - 0.7))) 

  sample  time parameter measurement
1 apple   16.6 turbidity       0.724
2 orange  12.7 turbidity       0.689

【问题讨论】:

    标签: r filter merge dplyr


    【解决方案1】:

    或者使用基础 R:

    df_list <- split(df, df$sample)
    turbidity_ref_pt <- 0.7
    do.call(rbind, lapply(df_list, function(x){
      turb_row<- x[x$parameter=='turbidity', ][which.min(abs(x$measurement[x$parameter=='turbidity'] - turbidity_ref_pt)), ]
      gluc_row <- x[x$parameter=='glucose', ][which.min(abs(x$time[x$parameter=='glucose']-turb_row$time)), ]
      muco_row <- x[x$parameter=='muconate', ][which.min(abs(x$time[x$parameter=='muconate']-turb_row$time)), ]
      rbind(turb_row, gluc_row, muco_row)
    }))
    
    #          sample   time parameter measurement
    #apple.3    apple 16.586 turbidity       0.724
    #apple.7    apple 17.254   glucose      17.300
    #apple.11   apple 17.647  muconate       9.600
    #orange.14 orange 12.655 turbidity       0.689
    #orange.18 orange 12.021   glucose      28.700
    #orange.22 orange 13.365  muconate       3.200
    

    【讨论】:

      【解决方案2】:
      library(data.table)
      setDT(df)
      
      # get index of turbidity rows with measurement closest to 0.7
      turb_Is <- 
        df[parameter == 'turbidity', .I[which.min(abs(measurement - 0.7))], sample]$V1
      # join df with subset identified by turb_Is to identify turbidity time
      df[df[turb_Is], on = .(sample), turbtime := i.time]
      
      # select rows with lowest difference from turbtime in each (sample, parameter) group
      df[df[, .I[which.min(abs(time - turbtime))], .(sample, parameter)]$V1, -'turbtime']
      #    sample   time parameter measurement
      # 1:  apple 16.586 turbidity       0.724
      # 2:  apple 17.254   glucose      17.300
      # 3:  apple 17.647  muconate       9.600
      # 4: orange 12.655 turbidity       0.689
      # 5: orange 12.021   glucose      28.700
      # 6: orange 13.365  muconate       3.200
      

      与 dplyr 相同的想法

      df %>% 
        group_by(sample) %>%
        filter(parameter == "turbidity") %>%
        slice(which.min(abs(measurement - 0.7))) %>% 
        select(sample, time) %>% 
        right_join(df, by = 'sample') %>% 
        group_by(sample, parameter) %>% 
        slice(which.min(abs(time.x - time.y))) %>% 
        select(-time.x) %>% 
        rename_at('time.y', ~ 'time')
      
      # # A tibble: 6 x 4
      # # Groups:   sample, parameter [6]
      #   sample  time parameter measurement
      #   <chr>  <dbl> <chr>           <dbl>
      # 1 apple   17.3 glucose        17.3  
      # 2 apple   17.6 muconate        9.6  
      # 3 apple   16.6 turbidity       0.724
      # 4 orange  12.0 glucose        28.7  
      # 5 orange  13.4 muconate        3.2  
      # 6 orange  12.7 turbidity       0.689
      

      更简单的 dplyr 方法(相同的输出)

      df %>% 
        group_by(sample) %>%
        group_modify(~{
          turb <- 
            filter(., parameter == 'turbidity') %>% 
              slice(which.min(abs(measurement - 0.7)))
          group_by(., parameter) %>% 
            slice(which.min(abs(time - turb$time)))
        })
      

      使用的数据

      structure(list(sample = c("apple", "apple", "apple", "apple", 
      "apple", "apple", "apple", "apple", "apple", "apple", "apple", 
      "apple", "orange", "orange", "orange", "orange", "orange", "orange", 
      "orange", "orange", "orange", "orange", "orange", "orange"), 
          time = c(0, 13.805, 16.586, 25.354, 0, 13.548, 17.254, 24.893, 
          0, 13.412, 17.647, 25.841, 0, 12.655, 14.214, 32.687, 0, 
          12.021, 15.687, 33.641, 0, 13.365, 18.259, 35.697), parameter = c("turbidity", 
          "turbidity", "turbidity", "turbidity", "glucose", "glucose", 
          "glucose", "glucose", "muconate", "muconate", "muconate", 
          "muconate", "turbidity", "turbidity", "turbidity", "turbidity", 
          "glucose", "glucose", "glucose", "glucose", "muconate", "muconate", 
          "muconate", "muconate"), measurement = c(0.153, 0.654, 0.724, 
          0.821, 34.1, 29.5, 17.3, 4.1, 0, 3.5, 9.6, 13.4, 0.116, 0.689, 
          0.715, 0.899, 35.6, 28.7, 16.4, 3.7, 0, 3.2, 8.5, 14.1)), row.names = c(NA, 
      -24L), class = "data.frame", index = structure(integer(0), "`__parameter`" = c(5L, 
      6L, 7L, 8L, 17L, 18L, 19L, 20L, 9L, 10L, 11L, 12L, 21L, 22L, 
      23L, 24L, 1L, 2L, 3L, 4L, 13L, 14L, 15L, 16L)))
      

      【讨论】:

      • 效果很好并且维护dplyr 管道!谢谢!
      猜你喜欢
      • 1970-01-01
      • 2017-09-15
      • 1970-01-01
      • 1970-01-01
      • 2018-08-15
      • 1970-01-01
      • 2017-06-21
      • 2019-05-31
      • 2019-05-27
      相关资源
      最近更新 更多