在 R 中的数据帧之间查找最近的条目答案

【问题标题】：Find nearest entry between dataframes in R在 R 中的数据帧之间查找最近的条目
【发布时间】：2020-11-30 16:25:20
【问题描述】：

我有两个要比较的数据框。我已经知道数据帧一（df1$BP）中的值不在数据帧二（df2$START 和 df2$STOP）的值范围内，但我想返回数据帧二中 df2$START 或 df2 的行$STOP 的值最接近 df1$BP，其中“Chr”列在数据集之间匹配（df1$Chr、df2$Chr）。

我已经设法做到了这一点（见 Q 的底部），但它看起来非常笨拙，我想知道是否有更简洁的方式来实现同样的目标。

所以对于数据帧一 (df1)，我们有：

 df1=data.frame(SNP=c("rs79247094","rs13325007"),
           Chr=c(2,3),
           BP=c(48554955,107916058))

df1

SNP         Chr    BP
rs79247094    2     48554955
rs13325007    3    107916058

对于数据框二，我们有：

 df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
           Chr=c(2,2,2,2,3,3,3,3),
           START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
           STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))

df2

 Clump    Chr      START       STOP
     1      2   28033538   27451538
     2      2   37576136   36998607
     3      2   58143438   57845065
     4      2   60389362   60242162
     5      3   80814042   79814042
     6      3  107379837  107118837
     7      3  136288405  135530405
     8      3  161777035  161092491

我有兴趣返回最接近 BP 的 START/STOP 值。理想情况下，我可以返回该行，以及 BP 和 START 或 STOP 之间的区别是什么（df3$Dist），例如：

df3

 Clump   Chr      START       STOP         SNP        BP       Dist
     3     2   58143438   57845065  rs79247094  48554955    9290110
     6     3  107379837  107118837  rs13325007  107916058    536221

我发现类似问题，例如：Return rows establishing a "closest value to" in R

但这些是根据固定值而不是变化的值（并匹配 Chr 列）来查找最接近的值。

我的啰嗦方法是：

df3<-right_join(df1,df2,by="Chr")

给我所有 df1 和 df2 的组合。

df3$start_dist<-abs(df3$START-df3$BP)

用 START 和 BP 之间的绝对差创建一列

df3$stop_dist<-abs(df3$STOP-df3$BP)

用 STOP 和 BP 之间的绝对差创建一列

df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]

创建一个列 (dist.compare)，打印 BP 和 START 或 STOP 之间的最小差异（以及该列的重新排序）

df3<- df3 %>%   group_by(SNP) %>%   mutate(Dist = first(dist.compare))

创建一个列 (Dist)，打印 df3$dist.compare 中的最小值

df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]

只打印 dist.compare 匹配 Dist 的行（因此是最小值），并删除中间列，并通过按丛重新排序来整理。现在这让我到达了我想去的地方：

df3

 Clump   Chr      START       STOP         SNP        BP       Dist
     3     2   58143438   57845065  rs79247094  48554955    9290110
     6     3  107379837  107118837  rs13325007  107916058    536221

但我觉得它非常复杂，想知道是否有人对如何改进该过程有任何提示？

提前致谢

【问题讨论】：

我不在我的桌面上，所以我不能提出一个完整的语法，但我会这样做 - pivot_longer 从 df2 开始和停止 cols 然后 left_join 它与 df1 然后找到距离和最后过滤最小距离 group_by。

标签： r dataframe dplyr match tidyverse

【解决方案1】：

按照您在语法中列出的逻辑，这是一个更简洁的dplyr 解决方案：

right_join你的数据框
根据绝对值创建变量dist.compare
按SNP分组
过滤以保持最小距离
按照您希望用于最终数据框的顺序选择变量。请注意，您可以在 dplyr::select 语句 (Dist = dist.compare) 中重命名变量

按clump排序值

 library(dplyr)

 df3 <- right_join(df1, df2, by = "Chr") %>% 
   mutate(dist.compare = ifelse(abs(START - BP) < abs(STOP - BP), abs(START - BP), abs(STOP - BP))) %>% 
   group_by(SNP) %>% 
   filter(dist.compare == min(dist.compare)) %>% 
   select(clump, Chr, START, STOP, SNP, BP, Dist = dist.compare) %>% 
   arrange(clump)

这给了我们：

  clump   Chr     START      STOP SNP               BP    Dist
  <dbl> <dbl>     <dbl>     <dbl> <chr>          <dbl>   <dbl>
1     3     2  58143438  57845065 rs79247094  48554955 9290110
2     6     3 107379837 107118837 rs13325007 107916058  536221

【讨论】：

这是我努力达到的干净和质朴的美丽水平，而不是我所管理的冗长的散文！非常感谢。这将真正帮助我未来的 dplyr 努力:)
没问题 - 乐于助人！