根据您的“一起参加最多活动”的说法,我假设您所说的相似性是指intersect。
您可以使用以下代码找到事件之间的交叉点~名称:
# All names that we have
nameAll <- unique(df$guests)
# Length of names vector
N <- length(nameAll)
# Function to find intersect between names
getSimilarity <- function(nameA, nameB, type = "intersect") {
# Subset events for name A
eventA <- subset(df, guests == nameA)$event
# Subset events for name B
eventB <- subset(df, guests == nameB)$event
# Fint intersect length between events
if (type == "intersect") {
res <- length(intersect(eventA, eventB))
}
# Find Jaccard index between events
if (type == "JC") {
res <- length(intersect(eventA, eventB)) / length(union(eventA, eventB))
}
# Return result
return(data.frame(type, value = res, nameA, nameB))
}
# Iterate over all possible combinations
# Using double loop for simpler representation
result <- list()
for(i in 1:(N-1)) {
for(j in (i+1):N) {
result[[length(result) + 1]] <- getSimilarity(nameAll[i], nameAll[j])
}
}
# Transform result to data.frame and order by similarity
result <- do.call(rbind, result)
# Showing top 5 pairs
head(result[with(result, order(-value)), ])
type value nameA nameB
1 intersect 3 John Doe Jane Doe
2 intersect 1 John Doe Mark White
3 intersect 1 John Doe Matthew Green
4 intersect 1 John Doe Janet Black
5 intersect 1 John Doe William Hill
Jaccard 也给出了相同的结果:
type value nameA nameB
1 JC 1.0000000 John Doe Jane Doe
15 JC 0.5000000 Janet Black William Hill
2 JC 0.3333333 John Doe Mark White
5 JC 0.3333333 John Doe William Hill
6 JC 0.3333333 Jane Doe Mark White
数据(df):
structure(list(event = c("birthday", "birthday", "birthday",
"wedding", "wedding", "wedding", "bar mitzvah", "bar mitzvah",
"bar mitzvah", "bar mitzvah", "retirement", "retirement"), guests = c("John Doe",
"Jane Doe", "Mark White", "John Doe", "Jane Doe", "Matthew Green",
"Janet Black", "John Doe", "Jane Doe", "William Hill", "Janet Black",
"Matthew Green")), .Names = c("event", "guests"), row.names = c(NA,
-12L), class = "data.frame")