查找列表中的共同元素答案

【问题标题】：Finding common elements in a list查找列表中的共同元素
【发布时间】：2018-04-07 07:30:39
【问题描述】：

假设我有 3 个字符向量。我想对它们进行一些评估，例如比较向量中的元素是否也在其他向量中找到。我不知道哪个向量最短，所以我想以编程方式计算它。

例如：

a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')

l1 <- list(a,b,c)
#print(l1)
l2 <- sapply(l1,length)
#print(l2)

pos <- which(l2==min(l2))
shortest <- l1[pos]
#print(shortest)

a1 <- l1[!seq(1,3) %in% pos][1]
a2 <- l1[!seq(1,3) %in% pos][2]
#print(a1)
#print(a2)

shortest[[1]][sapply(shortest,function(x) !x %in% unlist(c(a1,a2)))[,1]]

我想找到最短元素中的元素，但在其他两个元素中找不到。在这个例子中，我想得到“类型”作为结果。我也遇到了两个元素满足最小长度的问题（在这个例子中，长度是2,3,3，但我想处理2,2,3。希望能得到一些帮助。我需要在 11000 个列表上运行这个，比如 l1，我的向量长度最少为 20。

【问题讨论】：

标签： r

【解决方案1】：

一种方法是通过最少的元素数量和最少的单词频率形成列表元素的数据框然后filter。这还将捕获同一向量中多个唯一单词的实例。

library(tidyverse)
l1 %>% enframe() %>% unnest() %>%
  group_by(name) %>%
  mutate(list_n = n()) %>%
  ungroup() %>%
  group_by(value) %>%
  mutate(not_in = n()) %>%
  ungroup() %>%
  filter(list_n == min(list_n) & not_in == 1) %>%
  select(-list_n, -not_in)

# # A tibble: 1 x 2
#    name value
#   <int> <chr>
# 1     1 Type

【讨论】：

【解决方案2】：

将您的数据整理成观察向量和分组变量，在 data.frame 中进行协调

df = data.frame(
    word = unlist(l1),
    group = rep(seq_along(l1), lengths(l1)),
    stringsAsFactors = FALSE
)

（lengths() 是实现sapply(x, length) 的更有效方式）。

用你需要的信息来处理数据——每组的长度，每个单词的计数

df = cbind(df,
    word_count = as.vector(table(df$word)[df$word]),
    group_length = tabulate(df$group)[df$group]
)

首先按字数排序，然后按组长度对行进行排序

df[order(df$word_count, df$group_length),]

答案是第一行

> df[order(df$word_count, df$group_length),]
    word group word_count group_length
2   Type     1          1            2
4    Age     2          1            3
5   Meta     2          1            3
6     ID     3          1            3
7 Gender     3          1            3
8  Color     3          1            3
1   Name     1          2            2
3   Name     2          2            3

使用不同的指标对数据建模；如何实现取决于您希望使用的模型。

这与@hpesoj626 的答案基本相同，只是“整洁”步骤

tidy <- l1 %>% enframe() %>% unnest()

“操纵”步骤

manip <- tidy %>%
  group_by(name) %>% mutate(list_n = n()) %>% ungroup() %>%
  group_by(value) %>% mutate(not_in = n()) %>% ungroup()

和“模型”步骤

manip %>% filter(list_n == min(list_n) & not_in == 1) %>%
  select(-list_n, -not_in)

【讨论】：

【解决方案3】：

请查看对原始帖子的一些修改，包括添加一个供应商“d”，该供应商也有两个元素，因此与原始向量“a”的关系最短。如果我理解您的需要，那么对于最短的向量，它们将返回所有其他元素中的不匹配元素，这些元素与最短的那些不匹配（也就是说，在这个例子中，你不'不想比较 'a' 和 'd'，因为它们都是最短的；而是想将它们与 'b' 和 'c' 进行比较）。

下面的解决方案使用 setdiff() 函数来识别和返回差异。它还将所有非最短向量分组到一个包含唯一元素的向量中以一次进行比较，而不是单独迭代每个非最短向量。

a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')
d <- c('Name','Reason')

l1 <- list(a,b,c,d)
l2 <- sapply(l1,length)

pos <- which(l2==min(l2))
shortest <- l1[pos]

#All the lists that are not the shortest ones
not_shortest <- l1[-pos]

#Collapse all the lists we want to search through into a single vector of unique elements
all_not_shortest <- unique(unlist(not_shortest))

#All of the shortest vectors (here 'a' and 'd' tie for shortest) compare their element differences to the entire set of all elements in not shortest vectors
lapply(shortest,setdiff,all_not_shortest)

【讨论】：