基于 R 中的许多规则生成组的更好方法答案

【问题标题】：Better way for generating groups based on many rules in R基于 R 中的许多规则生成组的更好方法
【发布时间】：2023-02-07 17:58:12
【问题描述】：

我有一个包含许多列的数据集，通过每个行值组合，为另一列中的新值确定一组规则。不同的组合是多种多样的，并不是每个规则都包含所有列。此外，某些列的有机体名称往往很长。因此，我目前使用的方法（case_when）变得非常混乱，并且审查这些规则变得非常乏味。

我想知道是否有更好的方法来做到这一点，它更干净、更容易审查？我运行它的数据集有超过 70.000 个观察值，因此下面是一个可以使用的虚拟数据集。

col1   col2   col3   col4     col5  col6
1      A      43     string1  AA    verylongnamehere
2      B      22     string2  BB    anotherlongname
3      C      15     string3  CC    yetanotherlongname
4      D      100    string4  DD    hereisanotherlongname
5      E      60     string5  EE    thisisthelastlongname


test <- data.frame(
  col1 = c(1,2,3,4,5),
  col2 = c("A","B","C","D","E"),
  col3 = c(43,22,15,100,60),
  col4 = c("string1","string2","string3","string4","string5"),
  col5 = c("AA","BB","CC","DD","EE"),
  col6 = c("verylongnamehere", "anotherlongname","yetanotherlongname","hereisanotherlongname","thisisthelastlongname")
)

以下代码是我使用的规则和代码的示例：

library(dplyr)

test2 <- test %>%
  mutate(new_col = case_when(
    col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
    col3 >= 60 & col5 == "DD" ~ "result2",
    col1 %in% c(2,3,4) & 
     col2 %in% c("B","D") & 
     col5 %in% c("BB","CC","DD") & 
     col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
    TRUE ~ "result4"
  ))

【问题讨论】：

这通常很棘手，解决方案取决于具体情况。如果有少数条件，我会尝试为每个条件设计一个有意义的名称，根据该名称创建一个 TRUE/NA 列，然后创建一个新列 coalesce()。如果有很多条件，我可能更愿意将数据放在长格式中。几个问题：真实数据有多大？有几个条件？您是否预计必须定期更改规则或条件数量？一行是否可以满足两个条件（它们似乎并不相互排斥），如果可以，其意图是什么？您是否需要使用 dplyr 而不是 data.table？
case_when 可能是你最好的选择，如果有很多条件，比如你的例子，它通常和你展示的一样不可约

标签： r

【解决方案1】：

如果它们在电子表格中，则可能更容易查看条件。以下是您如何从中读取它们并构建您的case_when。

电子表格表示 (conditions.xlsx)：请注意，== 和 %in% 被视为默认值，此处未明确包含。

加载条件

library(readxl)
cond <- read_excel('conditions.xlsx')

dput(cond):

structure(list(Result = c("result1", "result2", "result3", "result4"
), col1 = c("1", NA, "c(2, 3, 4)", NA), col2 = c(""A"", NA, 
"c("B","D")", NA), col3 = c(NA, ">= 60", NA, NA), col4 = c(NA, 
NA, NA, NA), col5 = c(NA, ""DD"", "c("BB","CC","DD")", 
NA), col6 = c(""verylongnamehere"", NA, "c("anotherlongname","yetanotherlongname")", 
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

将条件处理为 case_when 命令：

# separate conditions and results
results <- cond$Result
cond <- trimws(as.matrix(cond[, -1]))

# add default %in% operator for vectors
add.in <- grepl('^c\(', cond)
cond[add.in] <- paste('%in%', cond[add.in])
# add default ==
add.equals <- grepl('^[^<>%!]', cond)
cond[add.equals] <- paste('==', cond[add.equals])

# add column names to conditions and join them together with ' & '
col.cond <- apply(cond, 1, (x) {
  col.cond <- paste(colnames(cond), x)[!is.na(x)]
  paste(col.cond, collapse=' & ')
})
# put TRUE where no condition was given (default value)
col.cond[col.cond==''] <- 'TRUE'

# add results and join all together
case.when <- paste0(col.cond, ' ~ "', results, '"', collapse=',
 ')
# complete the case_when()
case.when <- paste('case_when(
',
               case.when,
               '
)')

case.when 是您的 case_when 字符串形式的命令：

cat(case.when)
# case_when(
#  col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
#  col3 >= 60 & col5 == "DD" ~ "result2",
#  col1 %in% c(2, 3, 4) & col2 %in% c("B","D") & col5 %in% c("BB","CC","DD") & col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
#  TRUE ~ "result4" 
# )

现在我们只是解析它，评估并在mutate中使用：

test2 <- test %>% 
  mutate(new_col = eval(parse(text=case.when)))

#   col1 col2 col3    col4 col5                  col6 new_col
# 1    1    A   43 string1   AA      verylongnamehere result1
# 2    2    B   22 string2   BB       anotherlongname result3
# 3    3    C   15 string3   CC    yetanotherlongname result4
# 4    4    D  100 string4   DD hereisanotherlongname result2
# 5    5    E   60 string5   EE thisisthelastlongname result4

根据您的示例，我只考虑了使用 & 作为逻辑运算符的条件。如果同时使用 |，则必须在电子表格中为每个数据列添加另一列，指定用于该条件的逻辑运算符（& 或 |）。在带有括号的更复杂条件的情况下，这种方法可能是不可能的。

【讨论】：