ifelse 语句将值分配给新列，使用数值列表答案

【问题标题】：ifelse statement to assign values to a new column, working with lists of numeric valuesifelse 语句将值分配给新列，使用数值列表
【发布时间】：2022-01-05 16:09:57
【问题描述】：

我有一个看起来像这样的数据框：

# Minimal example dataframe

identifier <- c(
  "A",
  "B",
  "C",
  "D",
  "E",
  "F"
)

value_1 <- c(
  "1231811, 1231877",
  "1231911, 1233069, 1232767",
  "1231919",
  NA,
  "1232135, 1233145",
  NA
)

value_2 <- c(
  1231811,
  190477,
  922661,
  950711,
  992647,
  NA
  
)

value_3 <- c(
  1231877,
  1233069,
  9774041,
  9774041,
  1314063,
  1231379
  
)

test_df <- data.frame(identifier, value_1, value_2, value_3)

  identifier                   value_1 value_2 value_3
1          A          1231811, 1231877 1231811 1231877
2          B 1231911, 1233069, 1232767  190477 1233069
3          C                   1231919  922661 9774041
4          D                      <NA>  950711 9774041
5          E          1232135, 1233145  992647 1314063
6          F                      <NA>    <NA> 1231379

我想创建一个新列“final_value”，并使用 value_1、value_2 或 value_3 中的单个值填充它，该层次结构优先考虑与 value_2 中的值匹配的 value_1 值，然后是 value_3。如果 value_1 不是 NA 并且没有与 value_2 或 value_3 中的任何内容匹配的值，我想用逗号分隔的 value_1 字符串中的第一个值填充 final_value。如果 value_1 为 NULL，则用 value_2 填充 final_value，或者，如果它也是 null，则用 value_3 填充。最终的数据框如下所示：

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2          B 1231911, 1233069, 1232767  190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3          C                   1231919  922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4          D                      <NA>  950711 9774041 950711  # value_1 is NA, so fill in with value_2
5          E          1232135, 1233145  992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6          F                      <NA>    <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3

到目前为止，这是我的方法......

library(purrr)
library(dplyr)

# change value_1 column into a list of numeric values 
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))

# create a new column to hold the final selected value
test_df$final_value <- NA

# ifelse statement
test_df$final_value <- 
  
  # if any of the elements in value_1 match the value_2 value, fill the new column with value_2
  ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
         
         # otherwise, if a value in value_1 matches value_3, fill in with value_3
         ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
                
                # if none of the values in value_1 match the other columns, fill in with the first value_1 list value
                ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
                       
                       # if value_1 is NA, fill in with value_2
                       ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
                              
                              # if value_1 is NA and value_2 is NA, fill in with value_3
                              ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
         
         
  )))))

结果有几个问题：

  identifier                   value_1 value_2 value_3               final_value
1          A          1231811, 1231877 1231811 1314063          1231811, 1231877
2          B 1231911, 1233069, 1232767  190477 1233069 1231911, 1233069, 1232767
3          C                   1231919  922661 9774041                   1231919
4          D                        NA  950711 9774041                    950711
5          E          1232135, 1233145  992647 1314063          1232135, 1233145
6          F                        NA      NA 1231379                   1231379

ifelse 的前三行没有按预期工作。它未能在 final_value 中返回匹配的 value_2 或 value_3 值，我也无法让它从 value_1 返回第一个列表项，其中没有任何匹配的 value_2 或 value_3 值。对于后者，我尝试指定 test_df$value_1[[1]][1]（和类似的），但这仅返回标识符 A value_1 列表中的第一项：

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1314063     1231811
2          B 1231911, 1233069, 1232767  190477 1233069     1231811
3          C                   1231919  922661 9774041     1231811
4          D                        NA  950711 9774041      950711
5          E          1232135, 1233145  992647 1314063     1231811
6          F                        NA      NA 1231379     1231379

任何帮助将不胜感激。

【问题讨论】：

标签： r string-matching

【解决方案1】：

首先，将ifelse 嵌套超过 2 层通常会导致我建议使用case_when。但是，在这种情况下，我认为没有那个更好的解决方案：

func func <- function(A, ...) {
  if (length(A) == 1L && is.na(A)) {
    if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
  } else {
    L <- lapply(list(...), intersect, x = A)
    L <- c(L[lengths(L) > 0], A)
    L[[1]][1]
  }
}

library(dplyr)
test_df %>%
  mutate(
    final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
  )
#   identifier                   value_1 value_2 value_3 final_value
# 1          A          1231811, 1231877 1231811 1231877     1231811
# 2          B 1231911, 1233069, 1232767  190477 1233069     1233069
# 3          C                   1231919  922661 9774041     1231919
# 4          D                      <NA>  950711 9774041      950711
# 5          E          1232135, 1233145  992647 1314063     1232135
# 6          F                      <NA>      NA 1231379     1231379

因为我在func 中使用...，所以它可以根据需要处理“0 个或更多” 其他value_* 变量；如果您有 3 个或 30 个以上，它将应用相同的逻辑。此外，... 内的顺序很重要，前面列出的那些将优先匹配。

c(L[lengths(L) > 0], A) 确保 (1) 我们只考虑具有非空交叉点的 value_*（第一部分），如果所有这些交叉点都是空的，我们将使用在 A 中找到的内容。（万一A 是NA 并且所有value_* 都是空的，那么......你会得到NA。）

仅供参考，其中一个内部步骤是使用strsplit 将逗号分隔的数字字符串拆分为列表列。如果您要执行更多类似的操作，需要处理其中的各个组件，您可能更愿意使用mutate(value_1 = strsplit(value_1, "[ ,]+"))（或类似的）保持原样。

【讨论】：

也许L <- c(L[lengths(L) > 0], A) 上的一个词与这个出色解决方案中的流程有关。