在没有 for 循环的情况下使用多个 numpy 数组进行计算答案

【问题标题】：Doing calculations with multiple numpy arrays without for loops在没有 for 循环的情况下使用多个 numpy 数组进行计算
【发布时间】：2019-04-13 16:35:42
【问题描述】：

我是蛮力计算二维平面上从一个点到许多其他点的最短距离，数据来自使用df['column'].to_numpy() 的熊猫数据框。

目前，我正在使用 numpy 数组上的嵌套 for 循环来填充列表，获取该列表的最小值，并将该值存储在另一个列表中。

检查 1000 个点（来自 df_point）和 25,000 个点（来自 df_compare）大约需要一分钟，因为这是一个低效的过程可以理解。我的代码如下。

point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []

# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
    for i,j in np.nditer([compare_x,compare_y]):
        dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
    minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
    # Clear dummy array (otherwise it will continuously append to)
    dumarr = []

这不是一个特别的pythonic。有没有办法通过矢量化或至少不使用嵌套的 for 循环来做到这一点？

【问题讨论】：

您可以使用 scipy 库中的 cdist 来获得 1k x 25k 距离矩阵，然后在沿相应轴的距离矩阵上使用 numpy.min 来获得 1k 分钟的数组。假设您有足够的 RAM 在内存中保存完整的距离矩阵，它会快得多
@thesilkworm 你能举个例子说明使用四个数组而不是两个数组吗？
我假设你的 4 个数组是 1d，但最好确认一下（甚至可以举一些小例子）。并且不要使用nditer。 zip(compare_x, compare_y) 更简单（更快）。
@DrakeMurdoch - 它只适用于两个数组，但它们可以是二维数组，就像我刚刚发布的示例一样。

标签： python pandas numpy

【解决方案1】：

方法是创建一个 1000 x 25000 矩阵，然后找到行最小值的索引。

# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2

# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)

# Not sure what is needed from the indices, this get the values 
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]

【讨论】：

【解决方案2】：

我会给你方法：

创建 DataFrame，列为 ->pointID,CoordX,CoordY
创建偏移值为 1 的辅助 DataFrame (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
这个偏移值需要从1循环到坐标数-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
将此 tempDF 附加到列表中

这会更快的原因：

只有一个循环来迭代从 1 到坐标数 1 的偏移量
第 4 步已完成矢量化
利用 numpy squareroot 和 square 函数确保获得最佳结果

【讨论】：

【解决方案3】：

您可以尝试分别在 x 和 y 方向上找到最近的点，而不是找到最近的点，然后使用内置的 min 函数（如本问题的最佳答案）比较这两者以找到更接近的点：

min(myList, key=lambda x:abs(x-myNumber))

from list of integers, get number closest to a given value

编辑：如果您在一个函数调用中完成所有操作，您的循环最终会是这样的。另外，我不确定 min 函数是否最终会以与当前代码相同的时间循环遍历比较数组：

for k,m in np.nditer([point_x, point_y]): min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )

另一种选择是预先计算比较数组中所有点与 (0,0) 或另一个点 (-1000,1000) 的距离，基于此对比较数组进行排序，然后仅检查点与参考的距离相似。

【讨论】：

这里的问题是我需要查看距离的大小，因为在某些情况下，您不会单独查看每个坐标得到正确答案。

【解决方案4】：

这是一个使用 scipy cdist 的示例，非常适合此类问题：

import numpy as np
from scipy.spatial.distance import cdist

point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])

# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)

【讨论】：