匹配两个数据帧中具有最接近值的行答案

【问题标题】：Match rows from two dataframes with closest values匹配两个数据帧中具有最接近值的行
【发布时间】：2016-09-15 20:20:39
【问题描述】：

希望任何人都可以帮助我解决这个问题。我正在测量分支。我有两个数据集：df.ref（参考）和df.tst（建模）。参考资料指出，有三个分支df.ref$ID，每个分支的宽度和长度值。

df.ref <- data.frame(ID=c(1,2,3))
df.ref$length <- c(1.3,1.8,2.3)
df.ref$width <- c(0.5,0.7,0.9)
df.ref

df.tst 包含相同三个分支的建模测量值。但是，还有更多的分支，六个 df.tst$ID 也有长度和宽度的值。

df.tst <- data.frame(ID=c(1,2,3,4,5,6))
df.tst$length <- c(1.1,1.5,1.8,1.8,2.1,2.6)
df.tst$width <- c(0.6,0.6,0.7,0.9,0.8,1.0)
df.tst

我想使用阈值（例如 0.2）内的长度和宽度值来匹配最接近的模型值和参考值。结果可能是这样的：

results <- data.frame(ID.ref=c(1,2,3))
results$ID.tst.match <- c(1,3,5)
results

我尝试使用 find.matches，但结果不如预期。我也在考虑使用 RMSE 来查看每行的最小 RMSE，并进行迭代，但必须有一个更清洁的解决方案。

此外，可能存在没有解决方案（超出阈值）的情况。谢谢！！！

【问题讨论】：

嗨。我正在查看df.tst 中最接近df.ref 中任何行的行。在阈值内，row1 比 row2 最接近（按差异）。

标签： r match

【解决方案1】：

您可以使用的一种方法是使用 dist 函数测量数据点之间的所有成对欧式距离：

> dist_mat <- as.matrix(dist(combined[,c('length', 'width')]))
> dist_mat
          1         2         3         4         5         6         7         8         9
1 0.0000000 0.4000000 0.7071068 0.7615773 1.0198039 1.5524175 0.2236068 0.7071068 1.2369317
2 0.4000000 0.0000000 0.3162278 0.4242641 0.6324555 1.1704700 0.2236068 0.3162278 0.8544004
3 0.7071068 0.3162278 0.0000000 0.2000000 0.3162278 0.8544004 0.5385165 0.0000000 0.5385165
4 0.7615773 0.4242641 0.2000000 0.0000000 0.3162278 0.8062258 0.6403124 0.2000000 0.5000000
5 1.0198039 0.6324555 0.3162278 0.3162278 0.0000000 0.5385165 0.8544004 0.3162278 0.2236068
6 1.5524175 1.1704700 0.8544004 0.8062258 0.5385165 0.0000000 1.3928388 0.8544004 0.3162278
7 0.2236068 0.2236068 0.5385165 0.6403124 0.8544004 1.3928388 0.0000000 0.5385165 1.0770330
8 0.7071068 0.3162278 0.0000000 0.2000000 0.3162278 0.8544004 0.5385165 0.0000000 0.5385165
9 1.2369317 0.8544004 0.5385165 0.5000000 0.2236068 0.3162278 1.0770330 0.5385165 0.0000000

由于这包括数据框内和数据框之间的比较，因此您可以仅提取不同数据框中元素之间的距离：

> type_ind <- combined$type == 'test'
> type_ind
[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
> cross_comparisons <- dist_mat[type_ind, !type_ind]
> rownames(cross_comparisons) <- df.tst$ID
> colnames(cross_comparisons) <- df.ref$ID
> cross_comparisons
          1         2         3
1 0.2236068 0.7071068 1.2369317
2 0.2236068 0.3162278 0.8544004
3 0.5385165 0.0000000 0.5385165
4 0.6403124 0.2000000 0.5000000
5 0.8544004 0.3162278 0.2236068
6 1.3928388 0.8544004 0.3162278

接下来，要确定三个参考数据点中每一个的最近点，您只需找到每列中的最小值：

> apply(cross_comparisons, 2, which.min)
1 2 3 
1 3 5

要检查距离是否在您的阈值之内，您可以这样做：

> threshold <- 0.2
> apply(cross_comparisons, 2, function(x) { any(x < threshold) })
    1     2     3 
FALSE  TRUE FALSE

【讨论】：

感谢您的及时回复！