【发布时间】:2019-03-18 20:58:56
【问题描述】:
我想找到两个组之间变量的最接近匹配(最小差异),但如果已经进行了最接近的匹配,则继续进行下一个最接近的匹配,直到进行 n 次匹配。
我使用来自此answer(下)的代码为所有组的每个成对分组(即Location by VAR)找到Samples之间的value的最接近匹配。
但是,有很多重复,Sample.x 1、2 和 3 的顶部匹配可能都是 Sample.y1。
我想要的是找到与Sample.x 2 的下一个最接近的匹配,然后是 3,等等,直到我指定了不同的 (Sample.x-Sample.y) 匹配数。但是Sample.x 的顺序并不重要,我只是在寻找给定分组的Sample.x 和Sample.y 之间的前n 个匹配项。
我尝试使用dplyr::distinct 执行此操作,如下所示。但我不确定如何使用Sample.y 的不同条目来过滤数据帧,然后再用最小的DIFF 过滤。但是,这不一定会产生唯一的 Sample 配对。
有没有一种聪明的方法可以用 dplyr 在 R 中完成这个任务?这种操作有名称吗?
df01 <- data.frame(Location = rep(c("A", "C"), each =10),
Sample = rep(c(1:10), times =2),
Var1 = signif(runif(20, 55, 58), digits=4),
Var2 = rep(c(1:10), times =2))
df001 <- data.frame(Location = rep(c("B"), each =10),
Sample = rep(c(1:10), times =1),
Var1 = c(1.2, 1.3, 1.4, 1.6, 56, 110.1, 111.6, 111.7, 111.8, 120.5),
Var2 = c(1.5, 10.1, 10.2, 11.7, 12.5, 13.6, 14.4, 18.1, 20.9, 21.3))
df <- rbind(df01, df001)
dfl <- df %>% gather(VAR, value, 3:4)
df.result <- df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data dfl
map(~ full_join(dfl %>%
filter(Location == first(.x)),
dfl %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
# here I choose 5,
# and then hope that this is enough to get a smaller n of final matches
top_n(-5, DIFF) %>%
mutate(GG = paste(Location.x, Location.y, sep="-")))
res1 <- rbindlist(df.result)
res2 <- res1 %>% group_by(GG, VAR) %>% distinct(Sample.y)
res3 <- res2 %>% group_by(GG, VAR) %>% top_n(-2, DIFF)
【问题讨论】: