在R中的数据框中查找列的值答案

【问题标题】：Find the value of a column over a set of columns in a data frame in R在R中的数据框中查找列的值
【发布时间】：2021-06-07 15:45:18
【问题描述】：

我正在努力寻找跨 data.frame 的其他列的列值的方法。如果有人可以帮助我，我将不胜感激。这些是我的数据的简化形式：

library(data.table)

df<-data.table(personid<-c(101, 102, 103, 104, 105, 201, 202, 203, 301, 302, 401),
       hh_id<-c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4),
       fatherid<-c(NA, NA, 101, 101, 101, NA, NA, 201, NA, NA, NA),
       fatherid_1<-c(NA,101, 101, 101, NA, NA, 201, NA, NA, NA, NA),
       fatherid_2<-c(101, 101, 101, NA, NA, 201, NA, NA, NA, NA, NA),
       fatherid_3<-c(101, 101, NA, NA, NA, NA, NA, NA, NA, NA, NA),
       fatherid_4<-c(101, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
       fatherid_5<-c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))

（真正的185000行，最多有fatherid_1、fatherid_2...fatherid_17等17个变量）

我要做的是创建一个变量，检查给定行的变量personid 的值是否与同一行中变量fatherid_1 到fatherid_5 的任何值相同.对于给定的数据，结果应该是：

df$result <- c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)

但我需要一些东西来自动完成，超过 17 列，例如 fatherid_1，以及很多行

如果您想了解我的计算意义，我正在尝试构建家庭网格，而不只使用同一行中的信息

非常感谢您！

【问题讨论】：

非常感谢！这正是我所需要的。谢谢大家！但是，现在我想知道是否有任何方法可以在没有重复的“fatherid_1、2、3”变量的情况下做到这一点，即：通过在同一家庭成员的不同“fatherid”值中“寻找 personid 的值” (hh_id) ¿ 这可能吗？非常感谢
也许你对这个新话题提出了一个新问题。

标签： r

【解决方案1】：

两个 tidyverse 解决方案：

1-) 您可以使用 dplyr 的新 if_any()、.== 和 tidyr 的 replace_na()。 if_any() 不需要 rowwise() 或 reduce()/ Reduce():

library(dplyr)
library(tidyr)

df%>%mutate(result=replace_na(if_any(matches('fatherid'), ~.==personid), 0))

2-) 在rowwise() 操作中，您可以应用一个函数来检查具有map()、c_across() 和%in% 的所有选定列的条件，从而生成一个逻辑向量。然后可以在同一个调用中折叠/reduce()d。

library(purrr)
library(dplyr)

df%>%rowwise()%>%mutate(result=as.integer(reduce(map(c_across(fatherid_1:fatherid_5), ~. %in% personid), `|`)))

为了清楚起见，或者使用管道：

#option 1
df%>%rowwise()%>%
        mutate(result=map(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
                       reduce(`|`)%>%
                       as.integer())
#option 2
df%>%rowwise()%>%
        mutate(result=map_int(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
                       reduce(max))

    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

【讨论】：

我的朋友的绝妙解决方案。

【解决方案2】：

我们还可以将以下解决方案与来自purrr 包的pmap 一起使用：

library(dplyr)
library(purrr)

df %>%
  mutate(result = pmap_dbl(., ~ {x <- c(...)[-c(1, 2)]; 
  if_else(all(x[!is.na(x)] != c(...)[1]) | all(is.na(x)), 0, 1)}))


    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

【讨论】：

:) pmap的主人
哈哈哈，谢谢accumulate和reduce的高手。
哈哈哈，确实是pmap大师
@GuedesBF ，因为通常我每个问题都会迟到，pmap 通常是唯一剩下的选择，所以我得到了这个大声笑的声誉。但我不能否认我真的很喜欢purrr。我注意到你很擅长使用purrr。

【解决方案3】：

如果您不想使用rowwise，那么这也可以作为替代方法

library(dplyr)

df %>% group_by(personid) %>%
  mutate(res = sum(cur_group() %in% cur_data()))

# A tibble: 11 x 9
# Groups:   personid [11]
   personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5   res
      <dbl> <dbl>    <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <lgl>      <int>
 1      101     1       NA         NA        101        101        101 NA             1
 2      102     1       NA        101        101        101         NA NA             0
 3      103     1      101        101        101         NA         NA NA             0
 4      104     1      101        101         NA         NA         NA NA             0
 5      105     1      101         NA         NA         NA         NA NA             0
 6      201     2       NA         NA        201         NA         NA NA             1
 7      202     2       NA        201         NA         NA         NA NA             0
 8      203     2      201         NA         NA         NA         NA NA             0
 9      301     3       NA         NA         NA         NA         NA NA             0
10      302     3       NA         NA         NA         NA         NA NA             0
11      401     4       NA         NA         NA         NA         NA NA             0

^{由reprex package (v2.0.0) 于 2021-06-09 创建}

如果你想安全排除hh_id，你可以使用

df %>% group_by(personid) %>%
  mutate(res = sum(cur_group() %in% cur_data()[-1]))

【讨论】：

不错的无行解决方案。然而，我们必须承认，在这种情况下，按行分组（rowwise()）与按 personid 分组没有太大区别，因为每行可能有一个唯一的 personid。

【解决方案4】：

在 base 中使用== 进行比较并测试rowSums >0 是否存在解决方法：

+(rowSums(df[[1]] == df[,3:8], na.rm=TRUE) > 0)
# [1] 1 0 0 0 0 1 0 0 0 0 0

或者使用any 和apply。

+apply(df[[1]] == df[,3:8], 1, any, na.rm = TRUE)
# [1] 1 0 0 0 0 1 0 0 0 0 0

或相同但使用管道：

(df[[1]] == df[,3:8]) |> rowSums(na.rm=TRUE) |> (`>`)(0) |> as.integer()

(df[[1]] == df[,3:8]) |> apply(1, any, na.rm=TRUE) |> as.integer()

【讨论】：

【解决方案5】：

OP 的数据集是一个data.table 对象。我们可以使用data.table 方法。遍历“fatherid”列，检查“personid”是否等于列值，以及Reduce 是否为单个向量

library(data.table)
df[, result  := +(Reduce(`|`, lapply(.SD, function(x) 
      x == personid & !is.na(x)))), .SDcols = patterns('fatherid')]

-输出

df
    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

【讨论】：