查找配对之间最常见的组合答案

【问题标题】：Find most common combination between pairs查找配对之间最常见的组合
【发布时间】：2018-02-12 20:46:01
【问题描述】：

我有一份参加这些活动的活动和客人的名单。像这样，但文件更大：

event       guests
birthday    John Doe
birthday    Jane Doe
birthday    Mark White
wedding     John Doe
wedding     Jane Doe
wedding     Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement  Janet Black
retirement  Matthew Green

我想找到一起参加最多活动的两位客人的最常见组合。所以在这个例子中，答案应该是John Doe 和Jane Doe 一起参加最多的活动，因为他们都参加了三个相同的活动。输出应该是这些对的列表。

我什至从哪里开始？

【问题讨论】：

stackoverflow.com/help/someone-answers

【解决方案1】：

从社交网络/矩阵代数的角度来看略有不同的方法：

您的数据通过共享成员描述了个人之间的联系。这是一个隶属矩阵，我们可以计算个人 $i$ 和 $j$ 之间的连接矩阵，如下所示：

# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3), 
                           rep("wedding", 3), 
                           rep("bar mitzvah", 4), 
                           rep("retirement", 2)), 
                  guests = c("John Doe", "Jane Doe", "Mark White", 
                             "John Doe", "Jane Doe", "Matthew Green",   
                              "Janet Black", "John Doe", "Jane Doe",
                              "William Hill", "Janet Black", "Matthew Green"))

# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat


  ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
  #Jane Doe             3           1        3          1             1            1
  #Janet Black          1           2        1          0             1            1
  #John Doe             3           1        3          1             1            1
  #Mark White           1           0        1          1             0            0
  #Matthew Green        1           1        1          0             2            0
  #William Hill         1           1        1          0             0            1

现在我们想要去掉矩阵的对角线（告诉我们每个人参加了多少活动）和矩阵的两个三角形之一，其中包含冗余信息。

diag(admat) <- 0
admat[upper.tri(admat)] <- 0

现在我们只想转换为您可能喜欢的格式。我将使用 reshape2 库中的 melt 函数。

library(reshape2)
dfmatches <- unique(melt(admat))
# Drop all the zero matches
dfmatches <- dfmatches[dfmatches$value !=0,]
# order it descending
dfmatches <- dfmatches[order(-dfmatches$value),]
dfmatches

#            Var1        Var2 value
#3       John Doe    Jane Doe     3
#2    Janet Black    Jane Doe     1
#4     Mark White    Jane Doe     1
#5  Matthew Green    Jane Doe     1
#6   William Hill    Jane Doe     1
#9       John Doe Janet Black     1
#11 Matthew Green Janet Black     1
#12  William Hill Janet Black     1
#16    Mark White    John Doe     1
#17 Matthew Green    John Doe     1
#18  William Hill    John Doe     1

显然你可以通过重命名感兴趣的变量等来整理输出。

这种通用方法（我的意思是认识到您的数据描述了一个社交网络）可能会引起您的兴趣，以进行进一步分析（例如，如果人们参加聚会时可能会产生有意义的联系，同一个人，即使不是彼此）。如果您的数据集非常大，您可以通过使用稀疏矩阵或通过加载 igraph 包并使用其中的函数来声明社交网络来加快矩阵代数。

【讨论】：

非常好的方法！

【解决方案2】：

根据您的“一起参加最多活动”的说法，我假设您所说的相似性是指intersect。

您可以使用以下代码找到事件之间的交叉点~名称：

# All names that we have
nameAll <- unique(df$guests)
# Length of names vector
N <- length(nameAll)

# Function to find intersect between names
getSimilarity <- function(nameA, nameB, type = "intersect") {
    # Subset events for name A
    eventA <- subset(df, guests == nameA)$event
    # Subset events for name B
    eventB <- subset(df, guests == nameB)$event
    # Fint intersect length between events
    if (type == "intersect") {
        res <- length(intersect(eventA, eventB))
    }
    # Find Jaccard index between events
    if (type == "JC") {
        res <- length(intersect(eventA, eventB)) / length(union(eventA, eventB))
    }
    # Return result
    return(data.frame(type, value = res, nameA, nameB))
}

# Iterate over all possible combinations
# Using double loop for simpler representation    
result <- list()
for(i in 1:(N-1)) {
    for(j in (i+1):N) {
        result[[length(result) + 1]] <- getSimilarity(nameAll[i], nameAll[j])
    }
}
# Transform result to data.frame and order by similarity 
result <- do.call(rbind, result)
# Showing top 5 pairs
head(result[with(result, order(-value)), ])

       type value    nameA         nameB
1 intersect     3 John Doe      Jane Doe
2 intersect     1 John Doe    Mark White
3 intersect     1 John Doe Matthew Green
4 intersect     1 John Doe   Janet Black
5 intersect     1 John Doe  William Hill

Jaccard 也给出了相同的结果：

   type     value       nameA        nameB
1    JC 1.0000000    John Doe     Jane Doe
15   JC 0.5000000 Janet Black William Hill
2    JC 0.3333333    John Doe   Mark White
5    JC 0.3333333    John Doe William Hill
6    JC 0.3333333    Jane Doe   Mark White

数据（df）：

structure(list(event = c("birthday", "birthday", "birthday", 
"wedding", "wedding", "wedding", "bar mitzvah", "bar mitzvah", 
"bar mitzvah", "bar mitzvah", "retirement", "retirement"), guests = c("John Doe", 
"Jane Doe", "Mark White", "John Doe", "Jane Doe", "Matthew Green", 
"Janet Black", "John Doe", "Jane Doe", "William Hill", "Janet Black", 
"Matthew Green")), .Names = c("event", "guests"), row.names = c(NA, 
-12L), class = "data.frame")

【讨论】：

【解决方案3】：

我认为这里的答案很棒。我只是分享一些想法。如果您正在处理大型数据集，有许多客人或许多活动。许多条件都是可能的。例如，两个以上的客人都参加了最多的同一活动，或者两组客人参加了两个不同的活动，但总数相同。如果是这样的话，找到前两位客人可能还不够。

这里我想演示使用层次聚类来查找相似的客人或组。

我们可以先构造一个1和0的矩阵，1代表出席，0代表没有出席。

library(tidyverse)
library(vegan)

dat_m <- dat %>%
  mutate(value = 1) %>%
  spread(event, value, fill = 0) %>%
  column_to_rownames(var = "guests") %>%
  as.matrix()

dat_m
#               bar mitzvah birthday retirement wedding
# Jane Doe                1        1          0       1
# Janet Black             1        0          1       0
# John Doe                1        1          0       1
# Mark White              0        1          0       0
# Matthew Green           0        0          1       1
# William Hill            1        0          0       0

然后我们可以计算每个客人的距离。请注意，我使用了 vegan 包中的 vegdist 函数并设置了 binary = TRUE，因为我们正在处理二进制数据。

dat_dist <- vegdist(dat_m, binary = TRUE)

dat_dist
#                Jane Doe Janet Black  John Doe Mark White Matthew Green
# Janet Black   0.6000000                                               
# John Doe      0.0000000   0.6000000                                   
# Mark White    0.5000000   1.0000000 0.5000000                         
# Matthew Green 0.6000000   0.5000000 0.6000000  1.0000000              
# William Hill  0.5000000   0.3333333 0.5000000  1.0000000     1.0000000

然后我们可以进行层次聚类并查看结果。

hc <- hclust(dat_dist)
plot(hc)

根据树状图，Jane Doe 和 John Doe 最相似，作为一个组，它们与其他组最不同。

我们还可以检查Jane Doe 和John Doe 参加的活动人数最多。所以我们知道我们可以选择这两个。

rowSums(dat_m)
# Jane Doe   Janet Black      John Doe    Mark White Matthew Green  William Hill 
#        3             2             3             1             2             1

我再次认为其他人的答案更直接，并为您提供此示例数据集的输出，但如果您正在处理更大的数据集。层次聚类可能是一种选择。

【讨论】：