将重复参数传递给 Numpy 矢量化函数的最佳方法答案

【问题标题】：Best way to pass repeated parameter to a Numpy vectorized function将重复参数传递给 Numpy 矢量化函数的最佳方法
【发布时间】：2016-10-31 04:16:21
【问题描述】：

所以，继续讨论@TheBlackCat 和我在this answer 中的讨论，我想知道将参数传递给 Numpy 矢量化函数的最佳方式。有问题的函数是这样定义的：

vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)

其中，vincenty 来自 Geopy package。

我目前以这种方式致电vect_dist_funct：

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point))
    if len(intersect) > 0:
        points = pd.Series([point]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        return pd.Series(dist, index=intercept, name='Dist').sort_values()
    else:
        return pd.Series(np.nan, index=[0], name='Dist')

points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))

（请参考这里的问题：Labelled datatypes Python）

我的问题与函数 pointer 内部发生的情况有关。我将points 转换为pandas.Series 然后获取值（在第4 行，就在if 语句下方）的原因是使其具有与多边形相同的形状。如果我只是将点称为points = [point]*len(intersect) 或points = itertools.repeat(point, len(intersect))，Numpy 会抱怨它“不能同时广播大小为 (n,2) 和大小 (n,) 的数组”（n 是 intersect 的长度）。

如果我像这样调用vect_dist_funct：dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons)，vincenty 抱怨我传递了太多参数。我完全无法理解两者之间的区别。

请注意，这些是坐标，因此总是成对出现。以下是 point 和 polygons 的外观示例：

point = (-104.950752   39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
                  (-105.01017084789603, 39.82012158954065),
                  (-105.03965315742742, 40.669867471420886),
                  (-104.90353460825702, 39.837631505433706),
                  (-104.8650601872832, 39.870796282334744)], dtype=object)
           # As returned by statement centroid.loc[intersect].values

在这种情况下调用vect_dist_funct 的最佳方式是什么，这样我就可以进行矢量化调用，并且 Numpy 和 vincenty 都不会抱怨我传递了错误的参数？此外，还寻求导致最小内存消耗和提高速度的技术。目标是计算点到每个多边形质心之间的距离。

【问题讨论】：

快速浏览后，我发现您的问题不清楚。无论如何，至于“不能同时广播大小 (n,2) 和大小 (n,) 的数组”：你只能广播 (n,2) 形状为 (n,1)，所以你必须做介绍一个带有[:,None] 的尾随单例到你的一维数组中，以使这样的事情起作用。
另外，请注意[point]*len(intersect) 之类的内容：您最终可能会得到一个包含对同一变量的引用的列表。如果你稍后改变这个列表，你最终可能会错误地一次改变列表的所有元素。像[p for _ in range(len(intersect))] 这样的操作通常更安全。
@AndrasDeak，问题的哪一部分不清楚？我会相应地编辑它。基本上我要做的是以最有效的方式计算一个点和一组点之间的距离。我认为 Numpy 矢量化会为我加快速度。
好的，我明白了。循环可能确实是您的最佳选择。

标签： python numpy optimization geopy

【解决方案1】：

np.vectorize 在这里并不能真正帮助您。根据documentation：

提供矢量化功能主要是为了方便，而不是为了提高性能。该实现本质上是一个 for 循环。

事实上，vectorize 会伤害您，因为它将输入转换为 numpy 数组，进行不必要且昂贵的类型转换并产生您所看到的错误。最好使用带有for 循环的函数。

使用函数而不是lambda 来作为to-level 函数也更好，因为它可以让您拥有一个文档字符串。

所以这就是我将如何实现你正在做的事情：

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a numpy array where each row is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    return [vincenty(p1, p2i).meters for p2i in p2]

如果您真的想使用vectorize，可以使用excluded 参数来不对p1 参数进行矢量化，或者最好设置一个包装vincenty 的lambda，并且只对第二个参数进行矢量化：

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a list where each value is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    vinc_p = lambda x: vincenty(p1, x)
    return np.vectorize(vinc_p)(p2)

【讨论】：

我不太明白这个问题是什么，但你的回答听起来很有道理:)
谢谢。所以for循环是最好的实现。我希望有一些平行的东西。我在 Windows 服务器上，因此使用多处理库并行执行操作会消耗大量内存。
vincenty 算法本质上是迭代的，因此即使不是完全不可能也很难有效地向量化。
实际上，这比我在科罗拉多州大约 10% 的问题中的实施慢了大约 3 分钟。我想知道为什么会这样，你有什么想法吗？我在应用中计算基于空间索引的交集，然后遍历行并计算距离。
像这样：points['intersect'] = points['coords'].apply(lambda x: list(tree_idx.intersection(x)))，然后：for i, row in points.iterrows(): points.loc[i, 'dist'] = vect_dist_funct(row['coords'], lines.loc[row['intersect'], 'centroid']