【问题标题】:Filter dataframe based on a separate reference table根据单独的参考表过滤数据框
【发布时间】:2021-03-11 07:54:53
【问题描述】:

我有需要使用引用表以编程方式清理的数据。在引用表中,每一行与数据中的不同列相关,并指定过滤每个数据变量所依据的值。

示例

数据

library(tidyverse)

my_mtcars <-
  mtcars %>% 
  rownames_to_column("cars")

参考表

filter_ref_table <-
  structure(
  list(
    var_name = c(
      "disp",
      "wt",             
      "gear",          
      "carb",
      "mpg",
      "cars",           
      "drat"
    ),
    filtering_values = list(
      NULL,
      structure(
        list(
          min = 3.4,
          max = 3.9,
          values = list(NULL)
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      structure(
        list(
          min = NA_integer_,
          max = NA_integer_,
          values = list(c(3))
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      NULL,
      NULL,
      structure(
        list(
          min = NA_integer_,
          max = NA_integer_,
          values = list(c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
                          "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
                          "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
                          "Merc 240D", "Duster 360"))
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      NULL
    )
  ),
  row.names = c(NA,-7L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

filter_ref_table

## # A tibble: 7 x 2
##   var_name filtering_values
##   <chr>    <list>          
## 1 disp     <NULL>          
## 2 wt       <tibble [1 x 3]>
## 3 gear     <tibble [1 x 3]>
## 4 carb     <NULL>          
## 5 mpg      <NULL>          
## 6 cars     <tibble [1 x 3]>
## 7 drat     <NULL>    

仔细查看filter_ref_table 时,我们可以取消嵌套列表列filtering_values 并查看它是如何在内部构造的:一个包含3 列的嵌套小标题:minmaxvalues

filter_ref_table %>% 
  filter(var_name == "wt") %>% 
  unnest(filtering_values)

## # A tibble: 1 x 4
##   var_name   min   max values
##   <chr>    <dbl> <dbl> <list>
## 1 wt         3.4   3.9 <NULL> ## when there are min/max values we know we should filter by this range

##############################################################################

filter_ref_table %>% 
  filter(var_name == "cars") %>% 
  unnest(filtering_values)        
                                  

## # A tibble: 1 x 4
##   var_name   min   max values    
##   <chr>    <int> <int> <list>    
## 1 cars        NA    NA <chr [15]>   ## when there are values inside "value" we know that we should 
#                              ↑         ## filter to keep any data rows that have either of these values
#                              ↑ 
#   [1] "Maserati Bora"       "Chrysler Imperial"   "Toyota Corona"      
#   [4] "Merc 450SE"          "Lincoln Continental" "Mazda RX4"          
#   [7] "Valiant"             "Hornet 4 Drive"      "Fiat X1-9"          
#   [10] "Camaro Z28"          "Fiat 128"            "Mazda RX4 Wag"      
#   [13] "Datsun 710"          "Merc 240D"           "Duster 360"                  


#############################################################################################
filter_ref_table %>% 
  filter(var_name == "gear") %>% 
  unnest(filtering_values) %>%
  unnest(values)

## # A tibble: 1 x 4
##   var_name   min   max values
##   <chr>    <int> <int>  <dbl>
## 1 gear        NA    NA      3 

所以基于filter_ref_table,我们知道我们需要像这样过滤my_mtcars中的行:

expected_output <- 
  my_mtcars %>%
  filter(cars %in% c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
                     "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
                     "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
                     "Merc 240D", "Duster 360")) %>%
  filter(gear == 3) %>%
  filter(between(wt, 3.4, 3.9))

> expected_output

##         cars  mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 1    Valiant 18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
## 2 Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
## 3 Camaro Z28 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4

底线——我的问题是:我们如何以编程方式过滤my_mtcars,仅使用filter_ref_table 以最终得到expected_output

【问题讨论】:

    标签: r filter dplyr


    【解决方案1】:

    这是一种可能的解决方案

    doFilter <- function(data, criteria) {
      retVal <- data
      for (var in criteria %>% pull(var_name)) {
        crit <- criteria %>% filter(var_name == var) %>% unnest()
        minVal <- crit$min
        maxVal <- crit$max
        values <- crit$values
        if (!is.null(minVal)) {
          if (!is.na(minVal)) retVal <- retVal %>% filter(get(var) >= minVal)
        }
        if (!is.null(maxVal)) {
          if (!is.na(maxVal)) retVal <- retVal %>% filter(get(var) <= maxVal)
        }
        if (!is.null(values[[1]])) {
          if (length(values[[1]]) > 0) retVal <- retVal %>% filter(get(var) %in% values[[1]])
        }
      }
      return(retVal)
    }
    
    my_mtcars %>% doFilter(filter_ref_table)
    

    给予

            cars  mpg cyl disp  hp drat   wt  qsec vs am gear carb
    1    Valiant 18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
    2 Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
    3 Camaro Z28 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4
    

    关键是使用get()将字符列名转换为对象,从而适合tidyverse的NSE。

    顺便说一句,您使用NANULL 和零长度列表来表示“什么都不做”有点尴尬。

    更正和编辑

    我上面的原始代码无法过滤value。修复很明显也很容易。我很抱歉。

    在 cmets 中回答 OP 的问题并扩展我的最后一句话......

    如果您的过滤器数据集看起来像这样:

    carList <- c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
      "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
      "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
      "Merc 240D", "Duster 360")
    anotherFilterTable <- tibble(
      var_name = c("disp", "wt", "gear", "carb", "mpg",        "cars", "drat"),
      value=     c(    NA,   NA,      3,     NA,    NA,            NA,     NA),
      min=       c(    NA,  3.4,     NA,     NA,    NA,            NA,     NA),
      max=       c(    NA,  3.9,     NA,     NA,    NA,            NA,     NA),
      choices=   c(    NA,   NA,     NA,     NA,    NA, list(carList),     NA)
    ) 
    
    anotherFilterTable
    # A tibble: 7 x 5
      var_name value   min   max choices   
      <chr>    <dbl> <dbl> <dbl> <list>    
    1 disp        NA  NA    NA   <lgl [1]> 
    2 wt          NA   3.4   3.9 <lgl [1]> 
    3 gear         3  NA    NA   <lgl [1]> 
    4 carb        NA  NA    NA   <lgl [1]> 
    5 mpg         NA  NA    NA   <lgl [1]> 
    6 cars        NA  NA    NA   <chr [15]>
    7 drat        NA  NA    NA   <lgl [1]> 
    

    然后我们移除了一层嵌套,doFilter 函数可以变成(这次过滤value 以及其他条件)...

    doFilter <- function(data, criteria) {
      retVal <- data
      for (var in criteria %>% pull(var_name)) {
        crit <- criteria %>% filter(var_name == var)
        if (!is.na(crit$value)) retVal <- retVal %>% filter(get(var) == crit$value)
        if (!is.na(crit$min)) retVal <- retVal %>% filter(get(var) >= crit$min)
        if (!is.na(crit$max)) retVal <- retVal %>% filter(get(var) <= crit$max)
        if (!is.na(crit$choices)) {
          retVal <- retVal %>% filter(get(var) %in% crit$choices[[1]])
        }
      }
      return(retVal)
    }
    

    这有点短,在我看来,更容易阅读。

    此解决方案和 OP 的原始问题陈述都隐含地假定了一组固定的可能过滤标准。 (OP 的问题陈述也假定了固定的列名。)为了提供更大的灵活性 - 可能允许将不同的标准应用于不同数据集中的同一列,然后类似

    anotherFilterTable %>% 
      mutate(across(c(value, min, max), as.list)) %>% 
      pivot_longer(
        cols=c(value, min, max, choices),
        names_to="criterion",
        values_to="value"
      ) %>% 
      add_column(source="my_mtcars")
    # A tibble: 28 x 4
       var_name criterion value     source   
       <chr>    <chr>     <list>    <chr>    
     1 disp     value     <dbl [1]> my_mtcars
     2 disp     min       <dbl [1]> my_mtcars
     3 disp     max       <dbl [1]> my_mtcars
     4 disp     choices   <lgl [1]> my_mtcars
     5 wt       value     <dbl [1]> my_mtcars
     6 wt       min       <dbl [1]> my_mtcars
     7 wt       max       <dbl [1]> my_mtcars
     8 wt       choices   <lgl [1]> my_mtcars
     9 gear     value     <dbl [1]> my_mtcars
    10 gear     min       <dbl [1]> my_mtcars
    # … with 18 more rows
    

    可能会。 doFilter() 需要相应地修改,或者当然。我认为这种格式还允许指定 任意 过滤条件的定义(例如“仅那些mpg 位于mpg 值的第一个四分位数的行”),而无需指定每次定义新的潜在标准时修改doFilter() 函数。

    与往常一样,这是在灵活性和复杂性之间进行权衡。 OP 将需要决定最佳位置。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-02-23
      • 1970-01-01
      • 2020-05-08
      • 1970-01-01
      • 2022-01-01
      • 1970-01-01
      • 2020-07-01
      • 2021-12-01
      相关资源
      最近更新 更多