如何使用 dplyr 根据列的子集是否为 NA 创建新列答案

【问题标题】：How to create a new column based on if any of a subset of columns are NA with the dplyr如何使用 dplyr 根据列的子集是否为 NA 创建新列
【发布时间】：2021-08-29 17:43:27
【问题描述】：

感觉 mutate_at 或 mutate(across(...)) 应该可以做到这一点，但我不明白什么...

假设我们有以下内容。我包含了所需的输出desired，它是一个指示列，基于包含单词“test”的任何列是否具有NA 值：

library(tidyverse)

df <- tibble::tribble(
  ~id,    ~name, ~test_col, ~is_test, ~another_test, ~desired,
   1L, "mickey",        NA,      13L,           12L,       1L,
   2L, "donald",       19L,       NA,            NA,       1L,
   3L,  "daisy",       15L,      20L,           20L,       0L,
   4L,  "goofy",       18L,      14L,           10L,       0L,
   5L,  "pluto",       16L,      10L,            NA,       1L,
   6L, "minnie",       19L,      15L,           16L,       0L
  )

df
#> # A tibble: 6 x 6
#>      id name   test_col is_test another_test desired
#>   <int> <chr>     <int>   <int>        <int>   <int>
#> 1     1 mickey       NA      13           12       1
#> 2     2 donald       19      NA           NA       1
#> 3     3 daisy        15      20           20       0
#> 4     4 goofy        18      14           10       0
#> 5     5 pluto        16      10           NA       1
#> 6     6 minnie       19      15           16       0

但实际上我们开始时没有 desired 列：df_start <- df %>% select(-desired)。

我可以成功地使用fiter_at 仅获取包含“测试”的一个或多个列是NA 的观察结果：

df_start %>% 
  filter_at(vars(contains("test")), any_vars(is.na(.)))
#> # A tibble: 3 x 5
#>      id name   test_col is_test another_test
#>   <int> <chr>     <int>   <int>        <int>
#> 1     1 mickey       NA      13           12
#> 2     2 donald       19      NA           NA
#> 3     5 pluto        16      10           NA

我可以保存这个子集，然后使用 bind_rows，但我想在一个管道中创建 desired 列。再一次，感觉这应该可以通过mutate_at 或mutate(across(...)) 实现，但我还没有成功。

问题：如何使用 dplyr 在一个管道中创建指标列desired？

^{reprex package (v2.0.0) 于 2021-08-29 创建的示例}

【问题讨论】：

致所有人：我想在这里给出第二个答案。因为我刚刚使用cross为我解决了一个大问题。我认为值得分享。这可以吗，还是我应该更新我的第一个答案？我读了这个 meta.stackexchange.com/questions/25209/…> 但我不确定。非常感谢！

标签： r dplyr

【解决方案1】：

你可以使用

library(dplyr)

df %>% 
  mutate(desired = +if_any(contains("test"), is.na))

得到

# A tibble: 6 x 6
     id name   test_col is_test another_test desired
  <int> <chr>     <int>   <int>        <int>   <int>
1     1 mickey       NA      13           12       1
2     2 donald       19      NA           NA       1
3     3 daisy        15      20           20       0
4     4 goofy        18      14           10       0
5     5 pluto        16      10           NA       1
6     6 minnie       19      15           16       0

【讨论】：

【解决方案2】：

这是我们可以用across 做到这一点的一种方法：

更新：在天才 Martin Gal 的帮助下，简短版：

df %>% 
    select(-desired) %>% 
    mutate(desired = +(rowSums(is.na(across(test_col:another_test))) > 0))

长版，:-)

library(dplyr)
df %>% select(-desired) %>% 
    mutate(across(test_col:another_test, ~ case_when(is.na(.) ~ 1, 
                                                     TRUE ~ 0), .names ="New_{.col}")) %>% 
    mutate(desired = ifelse(New_test_col == 1 | 
                            New_is_test == 1 |
                            New_another_test == 1, 1, 0), .keep="unused")

输出：

     id name   test_col is_test another_test desired
  <int> <chr>     <int>   <int>        <int>   <dbl>
1     1 mickey       NA      13           12       1
2     2 donald       19      NA           NA       1
3     3 daisy        15      20           20       0
4     4 goofy        18      14           10       0
5     5 pluto        16      10           NA       1
6     6 minnie       19      15           16       0

【讨论】：

Uhrm...您是否打错了`mutate(desired = +(rowSums(is.na(across(test_col:another_test))) > 0))`？ ;-)
我还在处理这个案子。你的回答非常好。 + 将 TRUE FALSE 更改为 1 0。你怎么称呼这把+放在if_any之前。感谢您的回复，我会检查答案。
谢谢马丁，我在你的帮助下更新了我的答案！！！
+is.na(rowSums(across(test_col:another_test))) 更好
@JasonAizkalns 您可以在across 中将test_col:another_test 替换为contains("test")。其他选项包括matches()（使用正则表达式匹配）、starts_with() 和ends_with()。

【解决方案3】：

map/reduce 的选项

library(dplyr)
library(purrr)
df %>%
    mutate(desired = map(select(., contains('test')),is.na ) %>% 
          reduce(`|`) %>% 
       as.integer )

-输出

# A tibble: 6 x 6
     id name   test_col is_test another_test desired
  <int> <chr>     <int>   <int>        <int>   <int>
1     1 mickey       NA      13           12       1
2     2 donald       19      NA           NA       1
3     3 daisy        15      20           20       0
4     4 goofy        18      14           10       0
5     5 pluto        16      10           NA       1
6     6 minnie       19      15           16       0

【讨论】：

总是使用魔法，但我喜欢这个！教你以不同的方式思考问题。

【解决方案4】：

有趣的是，对于这些类型的逐行逻辑运算，有时会出现相同的答案 - MartinGals 的 if_any() 、TarJae 的 RowSums(across()) 和 akrun 的 map%>%reduce。

为了完整起见，另一个常见的最爱是pmapapproach：

output<-df%>%mutate(desired=pmap(across(contains('test')), ~{
        vector<-c(...)
        +any(is.na(vector))})

# OR simply

output<-df%>%mutate(desired=pmap(across(contains('test')), ~{
        +any(is.na(c(...)))})

【讨论】：

但实际上是马丁 ;-)
哦，是的，我的错，错字:-)