给定python中的阈值，有效地删除彼此靠近的数组答案

【问题标题】：Efficiently delete arrays that are close from each other given a threshold in python给定python中的阈值，有效地删除彼此靠近的数组
【发布时间】：2017-03-27 09:17:33
【问题描述】：

我正在使用 python 来完成这项工作并且在这里非常客观，我想找到一种“pythonic”方法来从数组数组中删除从阈值开始彼此接近的“重复项”。例如，给出这个数组：

[[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653], [ 6.199,  4.828,  1.653]]

观察[ 6.198, 4.827, 1.653] 和[ 6.199, 4.828, 1.653] 非常接近，它们的欧几里得距离是0.0014，所以它们几乎是“重复的”，我希望我的最终输出只是：

[[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653]]

我现在的算法是：

to_delete = [];
for i in unique_cluster_centers:
    for ii in unique_cluster_centers:
        if i == ii:
            pass;
        elif np.linalg.norm(np.array(i) - np.array(ii)) <= self.tolerance:
            to_delete.append(ii);
            break;

for i in to_delete:
    try:
        uniques.remove(i);
    except:
        pass;

但它真的很慢，我想知道一些更快和“pythonic”的方法来解决这个问题。我的容忍度是 0.0001。

【问题讨论】：

np.array(i) 应该是什么意思？我认为，它不会在真正的 Python/numpy 脚本中产生什么。
stackoverflow.com/a/41677769/901925scipy.spatial.distance 具有成对距离函数。

标签： python numpy duplicates distance

【解决方案1】：

通用方法可能是：

def filter_quadratic(data,condition):
    result = []
    for element in data:
        if all(condition(element,other) for other in result):
            result.append(element)
    return result

这是具有条件的通用高阶filter。仅当列表中的所有元素的条件满足*时，才会添加该元素。

现在我们仍然需要定义条件：

def the_condition(xs,ys):
    # working with squares, 2.5e-05 is 0.005*0.005 
    return sum((x-y)*(x-y) for x,y in zip(xs,ys)) > 2.5e-05

这给出了：

>>> filter_quadratic([[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653], [ 6.199,  4.828,  1.653]],the_condition)
[[5.024, 1.559, 0.281], [6.198, 4.827, 1.653]]

算法在 O(n²) 中运行，其中 n 是您赋予函数的元素数。但是，您可以使用 k-d 树使其更高效，但这需要一些更高级的数据结构。

【讨论】：

我会尝试那个，我避免使用 k-d 树。我更喜欢“从头开始”的东西

【解决方案2】：

如果您可以避免在嵌套循环中将每个列表元素与其他每个元素进行比较（这不可避免地是一个 O(n^2) 操作），那么效率会高得多。

一种方法是生成一个密钥，使两个“几乎重复”的密钥生成相同的密钥。然后，您只需对数据进行一次迭代，然后只插入结果集中尚未出现的值。

result = {}
for row in unique_cluster_centers:
    # round each value to 2 decimal places: 
    # [5.024,  1.559,  0.281] => (5.02,  1.56,  0.28)
    # you can be inventive and, say, multiply each value by 3 before rounding
    # if you want precision other than a whole decimal point.
    key = tuple([round(v, 2) for v in row])  # tuples can be keys of a dict
    if key not in result:
        result[key] = row
return result.values()  # I suppose the order of the items is not important, you can use OrderedDict otherwise

【讨论】：

要做到这一点，您仍然需要比较相邻的块（例如 1.004999、1.000000、0.00000 和 1.005001、1.000000、0.000000 在您的方案中会有不同的键）
嗯，是的，你是对的 :) 我想知道是否可以通过执行 3 遍数据集来解决它 - 额外的两个遍将使用基于值 half-bucket-size 比原来的大和小。这将在“桶”的边缘捕获这些值。 3 次连续传递仍将比嵌套循环快得多。
第二个想法 - 不，我之前评论中的想法不起作用:(示例：四舍五入时为 0.9 和 1.89