我应该如何为列表中的每个元素找到最近的邻居？答案

【问题标题】：How should I find nearest neighbors for every element in a list?我应该如何为列表中的每个元素找到最近的邻居？
【发布时间】：2011-10-12 15:50:21
【问题描述】：

我有两组整数A 和B（A 的大小小于或等于B），我想回答这个问题，“A 与@987654327 有多接近@？”。我想回答这个问题的方法是衡量你必须从A 中的给定a 到在B 中找到b 的距离。

我想产生的具体措施如下：对于每个a，找到最接近的b，唯一的问题是一旦我将b与a匹配，我就不能再使用b 与任何其他 a 匹配。（编辑：我尝试实现的算法总是更喜欢较短的匹配。因此，如果b 是多个a 的最近邻居，请选择最接近b 的a。我不是确定如果多个a 与b 的距离相同，该怎么办，现在我选择a 之前的b，但这很随意，不一定是最佳的。） 'll for make these sets, 最终产品，是一个直方图，在垂直轴上显示对数，在 x 轴上显示对的距离。

所以如果A = {1, 3, 4} 和B = {1, 5, 6, 7}，我将得到以下a,b 对：1,1、4,5、3,6。对于这些数据，直方图应显示距离为零的一对、距离为 1 的一对和距离为 3 的一对。

（这些集合的实际大小的上限约为 100,000 个元素，我从已从低到高排序的磁盘中读取它们。整数范围从 1 到 ~20,000,000。编辑：另外，A 的元素和B 是唯一的，即没有重复的元素。）

我想出的解决方案有点笨拙。我正在使用 Perl，但问题或多或少与语言无关。

$hash{5} = {a=>1, b=>1}

A

$hash{5} = {a=>1}

接下来，我遍历A 以查找出现在A 和B 中的所有哈希元素，将它们标记在度量中，然后将它们从哈希中删除。
李>
然后，我对所有哈希键进行排序，并使哈希的每个元素指向其最近的邻居，就像一个链表，其中给定的哈希元素现在看起来像 $hash{6} = {b=>1, previous=>4, next=>8}。链表不知道下一个和上一个元素是在A还是B。
然后我循环从 d=1 开始的对距离，并找到距离为 d 的所有对，标记它们，从哈希中删除它们，直到没有更多 A 的元素可以匹配。

循环如下所示：

for ($d=1; @a > 0; $d++) {
    @left = ();
    foreach $a in @a {
        $next = $a;
        # find closest b ahead of $a, stop searching if you pass $d
        while (exists $hash{$next}{next} && $next - $a < $d) {
            $next = $hash{$next}{next};
        }
        if ($next is in B && $next - $a == $d) {
            # found a pair at distance $d
            mark_in_measure($a, $next);
            remove_from_linked_list($next);
            remove_from_linked_list($a);
            next;
        }

        # do same thing looking behind $a
        $prev = $a;
        ...

        # you didn't find a match for $a
        push @left, $a;
    }
    @a = @left;
}

这个循环显然更喜欢匹配b's 的对，它们出现得比a's 晚；我不知道是否有一种明智的方法来决定以后是否比以前更好（在获得更接近的配对方面更好）。我感兴趣的主要优化是处理时间。

【问题讨论】：

你说 A = {1, 3, 4}, B = {1, 5, 6, 7}，你得到 a,b 对 1,1, 4,5, 3, 6，即距离 0、1、3 各一对。1,1,3,5,4,6 距离 0,2,2 的对决不是更好吗？
en.wikipedia.org/wiki/Levenshtein_distance
你用什么来计算匹配的费用？最小距离总和？距离平方和的最小值？
@jwpat7 我不确定它是否更好，因为对之间的总距离保持不变。可能构造 A 和 B 是可能的，这样总是更喜欢较短的对会导致对距离的总和更高，尽管对我来说你将如何去做并不明显。即便如此，这个简单的贪心算法对于我的应用来说已经足够好了。如果优化不是很复杂，那么我会对你如何做到这一点感兴趣。
@missingno 成本函数方法会很有趣，但我可以接受总是更喜欢较短匹配的算法。

标签： algorithm language-agnostic nearest-neighbor

【解决方案1】：

听起来您有 Assignment Problem 的特殊情况（在加权二分图中找到最小匹配）。

解决分配问题的算法在 O(N^3) 时对你来说太慢了，但我很确定你可以通过利用你的特定权重函数或你只想要一个直方图的方式来减少一些复杂性而不是完全匹配。

【讨论】：

【解决方案2】：

#!/usr/bin/perl

use strict;
use warnings FATAL => 'all';
use diagnostics;  

# http://www.hungarianalgorithm.com/solve.php?c=3-2-6-22--7-2-2-18--13-8-4-12--23-18-14-2&random=1
# https://www.topcoder.com/community/data-science/data-science-tutorials/assignment-problem-and-hungarian-algorithm/
# http://www.cse.ust.hk/~golin/COMP572/Notes/Matching.pdf

my @mat;
my @out_mat;

my $spaces    = 6;
my $precision = 0;

my $N = 10;
my $M = 12;
my $r = 100;

my @array1; my @array2;

for my $i (1..$N) {
    push @array1, sprintf( "%.${precision}f",  rand($r)  );
}

for my $i (1..$M) {
    push @array2, sprintf( "%.${precision}f",  rand($r)  );
}

#@array1 = ( 1, 3, 4);      # $mat[i]->[j] = abs( array1[i] - array2[j] )
#@array2 = ( 1, 5, 6, 7);

#                1     5     6     7

#     1     [    0*    4     5     6 ]

#     3     [    2     2*    3     4 ]

#     4     [    3     1     2*    3 ]

my $min_size  = $#array1 < $#array2 ? $#array1 : $#array2;
my $max_size  = $#array1 > $#array2 ? $#array1 : $#array2;

for (my $i = 0; $i < @array1; $i++){
   my @weight_function;
   for (my $j = 0; $j < @array2; $j++){
      my $dif = sprintf( "%.${precision}f", abs ($array1[$i] - $array2[$j])  );
      #my $dif = sprintf( "%.${precision}f", ($array1[$i] - $array2[$j])**2  ); 
      push @weight_function, $dif;
   }
   push @mat, \@weight_function;
}


# http://cpansearch.perl.org/src/TPEDERSE/Algorithm-Munkres-0.08/lib/Algorithm/Munkres.pm

Algorithm::Munkres::assign(\@mat,\@out_mat);


print "\n\@out_mat index  = [";
for my $index (@out_mat) {
   printf("%${spaces}d", $index);
}
print " ]\n";

print "\@out_mat values = [";

my %hash;
for my $i (0 .. $max_size){
   my $j = $out_mat[$i];
   last if ( $i > $min_size and $#array1 < $#array2 );
   next if ( $j > $min_size and $#array1 > $#array2 );
   my $dif = $mat[$i]->[$j];
   printf( "%${spaces}.${precision}f", $dif );
   $hash{ $dif } { $i } { 'index_array1' } = $i;
   $hash{ $dif } { $i } { 'index_array2' } = $j;
   $hash{ $dif } { $i } { 'value_array1' } = $array1[$i];
   $hash{ $dif } { $i } { 'value_array2' } = $array2[$j]; 
}

print " ]\n\n";


my $soma_da_dif = 0;

foreach my $min_diferenca ( sort { $a <=> $b } keys %hash ){
   foreach my $k ( sort { $a <=> $b } keys %{$hash{$min_diferenca}} ){
      $soma_da_dif += $min_diferenca;
      my $index_array1 = $hash{ $min_diferenca } { $k } { 'index_array1' };
      my $index_array2 = $hash{ $min_diferenca } { $k } { 'index_array2' };
      my $value_array1 = $hash{ $min_diferenca } { $k } { 'value_array1' };
      my $value_array2 = $hash{ $min_diferenca } { $k } { 'value_array2' };
      printf( "   index (%${spaces}.0f,%${spaces}.0f), values (%${spaces}.${precision}f,%${spaces}.${precision}f), dif = %${spaces}.${precision}f\n", 
              $index_array1, $index_array2, $value_array1, $value_array2, $min_diferenca );

   }
}
print "\n\nSum = $soma_da_dif\n";





#-------------------------------------------------#
#------------------ New-Package ------------------# 

{ # start scope block

package Algorithm::Munkres;

use 5.006;
use strict;
use warnings;

require Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw( assign );
our $VERSION = '0.08';

...
... <---- copy all the 'package Algorithm::Munkres' here
...

return $minval;
}

1;  # don't forget to return a true value from the file

} # end scope block

【讨论】：