【发布时间】:2021-03-11 07:54:53
【问题描述】:
我有需要使用引用表以编程方式清理的数据。在引用表中,每一行与数据中的不同列相关,并指定过滤每个数据变量所依据的值。
示例
数据
library(tidyverse)
my_mtcars <-
mtcars %>%
rownames_to_column("cars")
参考表
filter_ref_table <-
structure(
list(
var_name = c(
"disp",
"wt",
"gear",
"carb",
"mpg",
"cars",
"drat"
),
filtering_values = list(
NULL,
structure(
list(
min = 3.4,
max = 3.9,
values = list(NULL)
),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-1L)
),
structure(
list(
min = NA_integer_,
max = NA_integer_,
values = list(c(3))
),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-1L)
),
NULL,
NULL,
structure(
list(
min = NA_integer_,
max = NA_integer_,
values = list(c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE",
"Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive",
"Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710",
"Merc 240D", "Duster 360"))
),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-1L)
),
NULL
)
),
row.names = c(NA,-7L),
class = c("tbl_df",
"tbl", "data.frame")
)
filter_ref_table
## # A tibble: 7 x 2
## var_name filtering_values
## <chr> <list>
## 1 disp <NULL>
## 2 wt <tibble [1 x 3]>
## 3 gear <tibble [1 x 3]>
## 4 carb <NULL>
## 5 mpg <NULL>
## 6 cars <tibble [1 x 3]>
## 7 drat <NULL>
仔细查看filter_ref_table 时,我们可以取消嵌套列表列filtering_values 并查看它是如何在内部构造的:一个包含3 列的嵌套小标题:min、max 和values。
filter_ref_table %>%
filter(var_name == "wt") %>%
unnest(filtering_values)
## # A tibble: 1 x 4
## var_name min max values
## <chr> <dbl> <dbl> <list>
## 1 wt 3.4 3.9 <NULL> ## when there are min/max values we know we should filter by this range
##############################################################################
filter_ref_table %>%
filter(var_name == "cars") %>%
unnest(filtering_values)
## # A tibble: 1 x 4
## var_name min max values
## <chr> <int> <int> <list>
## 1 cars NA NA <chr [15]> ## when there are values inside "value" we know that we should
# ↑ ## filter to keep any data rows that have either of these values
# ↑
# [1] "Maserati Bora" "Chrysler Imperial" "Toyota Corona"
# [4] "Merc 450SE" "Lincoln Continental" "Mazda RX4"
# [7] "Valiant" "Hornet 4 Drive" "Fiat X1-9"
# [10] "Camaro Z28" "Fiat 128" "Mazda RX4 Wag"
# [13] "Datsun 710" "Merc 240D" "Duster 360"
#############################################################################################
filter_ref_table %>%
filter(var_name == "gear") %>%
unnest(filtering_values) %>%
unnest(values)
## # A tibble: 1 x 4
## var_name min max values
## <chr> <int> <int> <dbl>
## 1 gear NA NA 3
所以基于filter_ref_table,我们知道我们需要像这样过滤my_mtcars中的行:
expected_output <-
my_mtcars %>%
filter(cars %in% c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE",
"Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive",
"Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710",
"Merc 240D", "Duster 360")) %>%
filter(gear == 3) %>%
filter(between(wt, 3.4, 3.9))
> expected_output
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
## 2 Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
## 3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
底线——我的问题是:我们如何以编程方式过滤my_mtcars,仅使用filter_ref_table 以最终得到expected_output?
【问题讨论】: