根据其他两列中的值创建是/否列答案

【问题标题】：Create yes/no column based on values in two other columns根据其他两列中的值创建是/否列
【发布时间】：2021-10-05 07:23:59
【问题描述】：

我有一个如下所示的数据集：

df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA", 
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe", 
"NA", "NA", "NA", "Europe", 
"NA", "NA", "NA", "NA"
)), 
class = "data.frame", row.names = c(NA, -10L))

我想创建一个名为EuropeYN 的新列，这取决于区域列（region1 或region2）是否包含“欧洲”。最终数据应如下所示：

df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA", 
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe", 
"NA", "NA", "NA", "Europe", 
"NA", "NA", "NA", "NA"
), EuropeYN = c("yes", "yes", "no", "no", "yes", "yes", "no", "no", "yes", "no")), 
class = "data.frame", row.names = c(NA, -10L))

如果只是检查“欧洲”是否出现在一列中，我知道该怎么做，但在检查多列时不知道如何做到这一点。如果只有一列，我会这样做：

df$EuropeYN <- ifelse(grepl("Europe",df$region1), "yes", "no")

关于解决此问题的最佳方法的任何想法？...

【问题讨论】：

ifelse(df$Region1 == "Europe" | df$Region2 == "Europe", "yes", "no")

标签： r string if-statement stringr grepl

【解决方案1】：

我们可以在这里使用if_any 作为tidyverse 中的矢量化选项

library(dplyr)
library(stringr)
df %>%
     mutate(YN = if_any(starts_with("Region"), str_detect, 'Europe'))
   ID       Region1 Region2    YN
1   1        Europe      NA  TRUE
2   2            NA  Europe  TRUE
3   3          Asia      NA FALSE
4   4            NA      NA FALSE
5   5        Europe      NA  TRUE
6   6            NA  Europe  TRUE
7   7        Africa      NA FALSE
8   8            NA      NA FALSE
9   9        Europe      NA  TRUE
10 10 North America      NA FALSE

或在base R

df$YN <-  Reduce(`|`, lapply(df[startsWith(names(df), 'Region')], 
        `%in%`, 'Europe'))

注意：使用逻辑标志而不是 "Yes"/"No" 进行子集化更容易

【讨论】：

【解决方案2】：

有点晚了，但也许还是值得一看：

library(dplyr)
library(stringr)
df %>%
  rowwise() %>%
  mutate(YN = +any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise: 
      ID Region1       Region2    YN
   <int> <chr>         <chr>   <int>
 1     1 Europe        NA          1
 2     2 NA            Europe      1
 3     3 Asia          NA          0
 4     4 NA            NA          0
 5     5 Europe        NA          1
 6     6 NA            Europe      1
 7     7 Africa        NA          0
 8     8 NA            NA          0
 9     9 Europe        NA          1
10    10 North America NA          0

或者，没有+：

df %>%
   rowwise() %>%
   mutate(YN = any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise: 
      ID Region1       Region2 YN   
   <int> <chr>         <chr>   <lgl>
 1     1 Europe        NA      TRUE 
 2     2 NA            Europe  TRUE 
 3     3 Asia          NA      FALSE
 4     4 NA            NA      FALSE
 5     5 Europe        NA      TRUE 
 6     6 NA            Europe  TRUE 
 7     7 Africa        NA      FALSE
 8     8 NA            NA      FALSE
 9     9 Europe        NA      TRUE 
10    10 North America NA      FALSE

如果您想在多个列中使用mutate，您可以使用starts_with（或contains 或ends_with）来处理这些列：

df %>%
  rowwise() %>%
  mutate(YN = any(str_detect(c_across(starts_with('R')), 'Europe')))

【讨论】：

【解决方案3】：

这是一个向量化的base R方式。

i <- rowSums(df[grep("Region", names(df))] == "Europe") > 0
df$EuropeYN <- c("no", "yes")[i + 1L]

【讨论】：

【解决方案4】：

两种方式：

从字面上检查两列中的每一列：

ifelse(df$Region1 == "Europe" | df$Region2 == "Europe", "yes", "no")
#  [1] "yes" "yes" "no"  "no"  "yes" "yes" "no"  "no"  "yes" "no"

这样的好处是更容易阅读（主观）并且非常清晰。

选择一个范围列并寻找相等性：

subset(df, select = Region1:Region2) == "Europe"
#    Region1 Region2
# 1     TRUE   FALSE
# 2    FALSE    TRUE
# 3    FALSE   FALSE
# 4    FALSE   FALSE
# 5     TRUE   FALSE
# 6    FALSE    TRUE
# 7    FALSE   FALSE
# 8    FALSE   FALSE
# 9     TRUE   FALSE
# 10   FALSE   FALSE

apply(subset(df, select = Region1:Region2) == "Europe", 1, any)
#     1     2     3     4     5     6     7     8     9    10 
#  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

这允许我们使用 1 列或更多列。

其中任何一个都可以使用df$EuropeYN <- ... 重新分配到框架中。

【讨论】：

【解决方案5】：

我的做法和你的很相似：

dplyr::mutate(df, EuropeYN = ifelse((Region1 == "Europe" | Region2 == "Europe"), "yes", "no"))

【讨论】：

谢谢！这与您发布之前其他人的建议类似:)
人们回复很快:(