使用分组变量查找重复项答案

【问题标题】：find duplicates with grouped variables使用分组变量查找重复项
【发布时间】：2019-09-05 11:01:11
【问题描述】：

我有一个看起来像这样的 df：

我猜它会与 dplyr 和副本一起使用。但是我不知道如何在区分分组变量时处理多个列。

from  to  group

1     2   metro
2     4   metro
3     4   metro
4     5   train
6     1   train
8     7   train

我想找到存在于多个group 变量中的ids。

样本df 的预期结果是：1 和4。因为它们存在于地铁和火车组中。

提前谢谢你！

【问题讨论】：

标签： r duplicates identify

【解决方案1】：

使用base R，我们可以split基于group的前两列，并使用intersect找到组之间的相交值

Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4

【讨论】：

喜欢Reduce 功能。不错 1！

【解决方案2】：

我们gather将'from'、'to'列改为'long'格式，按'val'分组，filter具有多个唯一元素的组，然后pull唯一'val'元素

library(dplyr)
library(tidyr)
df1 %>% 
   gather(key, val, from:to) %>% 
   group_by(val) %>% 
   filter(n_distinct(group) > 1) %>%
   distinct(val) %>%
   pull(val)
#[1] 1 4

或者使用base R 我们可以通过table 找到频率，并从中获取ID

out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"

数据

df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
 4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train", 
 "train", "train")), class = "data.frame", row.names = c(NA, -6L
 ))

【讨论】：

【解决方案3】：

使用data.table 将数据转换为长格式并计算唯一值。 melt用于转长格式，数据表允许在df1[ i, j, k]的i部分进行过滤，在k部分进行分组，在j部分进行pulling。

library(data.table)
library(magrittr)
setDT(df1)

melt(df1, 'group') %>% 
  .[, .(n = uniqueN(group)), value] %>% 
  .[n > 1, unique(value)]

# [1] 1 4

【讨论】：