【问题标题】:Find closest match, then next closest, between groups until a specified number of matches has been made在组之间查找最接近的匹配,然后是下一个最接近的匹配,直到完成指定数量的匹配
【发布时间】:2019-03-18 20:58:56
【问题描述】:

我想找到两个组之间变量的最接近匹配(最小差异),但如果已经进行了最接近的匹配,则继续进行下一个最接近的匹配,直到进行 n 次匹配。

我使用来自此answer(下)的代码为所有组的每个成对分组(即Location by VAR)找到Samples之间的value的最接近匹配。

但是,有很多重复,Sample.x 1、2 和 3 的顶部匹配可能都是 Sample.y1。

我想要的是找到与Sample.x 2 的下一个最接近的匹配,然后是 3,等等,直到我指定了不同的 (Sample.x-Sample.y) 匹配数。但是Sample.x 的顺序并不重要,我只是在寻找给定分组的Sample.xSample.y 之间的前n 个匹配项。

我尝试使用dplyr::distinct 执行此操作,如下所示。但我不确定如何使用Sample.y 的不同条目来过滤数据帧,然后再用最小的DIFF 过滤。但是,这不一定会产生唯一的 Sample 配对。

有没有一种聪明的方法可以用 dplyr 在 R 中完成这个任务?这种操作有名称吗?

 df01 <- data.frame(Location = rep(c("A", "C"), each =10), 
                   Sample = rep(c(1:10), times =2),
                   Var1 =  signif(runif(20, 55, 58), digits=4),
                   Var2 = rep(c(1:10), times =2)) 
df001 <- data.frame(Location = rep(c("B"), each =10), 
                    Sample = rep(c(1:10), times =1),
                    Var1 = c(1.2, 1.3, 1.4, 1.6, 56, 110.1, 111.6, 111.7, 111.8, 120.5),
                    Var2 = c(1.5, 10.1, 10.2, 11.7, 12.5, 13.6, 14.4, 18.1, 20.9, 21.3))
df <- rbind(df01, df001)
dfl <- df %>% gather(VAR, value, 3:4)

df.result <- df %>% 
  # get the unique elements of Location
  distinct(Location) %>% 
  # pull the column as a vector
  pull %>% 
  # it is factor, so convert it to character
  as.character %>% 
  # get the pairwise combinations in a list
  combn(m = 2, simplify = FALSE) %>%
  # loop through the list with map and do the full_join
  # with the long format data dfl
  map(~ full_join(dfl %>% 
                    filter(Location == first(.x)), 
                  dfl %>% 
                    filter(Location == last(.x)), by = "VAR") %>% 
        # create a column of absolute difference
        mutate(DIFF = abs(value.x - value.y)) %>%
        # grouped by VAR, Sample.x
        group_by(VAR, Sample.x) %>%
        # apply the top_n with wt as DIFF
        # here I choose 5, 
        # and then hope that this is enough to get a smaller n of final matches
        top_n(-5, DIFF) %>%
        mutate(GG = paste(Location.x, Location.y, sep="-")))

res1 <- rbindlist(df.result)
res2 <- res1 %>% group_by(GG, VAR) %>% distinct(Sample.y)    
res3 <- res2 %>% group_by(GG, VAR) %>% top_n(-2, DIFF)

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    我通过删除top_n(-5, DIFF) %&gt;% 行来编辑上面生成df.result 的代码。现在res1 包含Sample.xSample.y 的所有匹配项。

    然后我在下面的代码中使用了res1。这可能并不完美,但它所做的是为Sample.x 的第一个条目找到最接近的Sample.y 匹配项。然后这两个Samples 都从数据帧中过滤出来。重复匹配,直到为 Sample.y 的每个唯一值找到匹配为止。结果可能会有所不同,具体取决于首先进行的匹配。

      fun <- function(df) {
      HowMany <- length(unique(df$Sample.y))
      i <- 1
      MyList_FF <- list()
      df_f <- df
      while (i <= HowMany){
        res1 <- df_f %>%
          group_by(grp, VAR, Sample.x) %>%
          filter(DIFF == min(DIFF)) %>%
          ungroup() %>%
          mutate(Rank1 = dense_rank(DIFF))
    
        res2 <- res1 %>% group_by(grp, VAR) %>% filter(rank(Rank1, ties.method="first")==1)
    
        SY <- as.numeric(res2$Sample.y)
        SX <- as.numeric(res2$Sample.x)
        res3 <- df_f %>% filter(Sample.y != SY) # filter Sample.y
        res4 <- res3 %>% filter(Sample.x != SX) # filter Sample.x
        df_f <- res4
    
        MyList_FF[[i]] <- res2
    
        i <- i + 1
      }
      do.call("rbind", MyList_FF) # https://stackoverflow.com/a/55542822/1670053
    }
    
    df <- res1
    MyResult <- df %>%
      dplyr::group_split(grp, VAR) %>%
      map_df(fun)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-03-01
      • 2015-01-23
      • 2011-07-24
      • 1970-01-01
      • 2020-01-30
      • 2015-03-18
      • 1970-01-01
      相关资源
      最近更新 更多