通过两个 Pandas DataFrame 加速嵌套 for 循环答案

【问题标题】：Speeding up a nested for loop through two Pandas DataFrames通过两个 Pandas DataFrame 加速嵌套 for 循环
【发布时间】：2018-03-16 07:39:54
【问题描述】：

我有一个纬度和经度存储在熊猫数据框 (df) 中，填充点为 NaN 用于stop_id, stoplat, stoplon，另一个数据框 areadf 包含更多的纬度/经度和任意 id；这是要填充到df 中的信息。

我正在尝试将两者连接起来，以便df 中的停靠点列包含有关最接近该纬度/经度点的停靠点的信息，或者如果在半径 R 内没有停靠点，则将其保留为 NaN点。

现在我的代码如下，但是在将 area 更改为 df 并使用 itertuples 之前，它需要很长的时间（我目前正在运行的 > 40 分钟；不确定这有多大的差异会吗？）因为每组数据都有数千个纬度/经度点和停靠点，这是一个问题，因为我需要在多个文件上运行它。我正在寻找建议以使其运行得更快。我已经做了一些非常小的改进（例如移动到数据帧，使用 itertuples 而不是 iterrows，在循环之外定义 lats 和 lons 以避免在每个循环中从 df 中检索它），但我对加快速度。 getDistance 使用定义的Haversine 公式来获取停车标志和给定纬度、经度点之间的距离。

import pandas as pd
from math import cos, asin, sqrt

R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
    for index in df.index:
        if getDistance(lats[index],lons[index],
                       stop[1],stop[2]) < R:
            df.at[index,'stop_id'] = stop[0] # id
            df.at[index,'stoplat'] = stop[1] # lat
            df.at[index,'stoplon'] = stop[2] # lon

def getDistance(lat1,lon1,lat2,lon2):
    p = 0.017453292519943295     #Pi/180
    a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) * 
         cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
    return 12742 * asin(sqrt(a)) * 100

样本数据：

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  NaN        NaN        NaN
43.694324  -79.334555  NaN        NaN        NaN

areadf
stop_id    stoplat    stoplon
0          43.657675  -79.380145
1          45.435143  -90.543253

期望：

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  0          43.657675  -79.380145
43.694324  -79.334555  NaN        NaN        NaN

【问题讨论】：

你可以使用 pypy 代替 cython，pypy 编译成 c 来加速 python 中的循环
1.不要像那样迭代数据帧，利用 pandas 2. 使用欧几里德距离作为第一遍，并拉出一些最近的点，因为它比 Haversine 3 便宜。将您的数据子集到纬度/经度网格中，网格 x 中没有任何内容并且其周围的 8 个单元格位于网格 y 中任何内容的 R 内，并且在子集停靠点与点上运行。
@jeremycg 你有什么建议我研究的函数可以更好地利用 pandas 吗？感谢您的回复！
这是今年 pycon 的一个视频，演示者在 pandas 中优化了几乎这个精确的功能 - youtube.com/watch?v=HN5d490_KKk 代码在这里 - github.com/sversh/pycon2017-optimizing-pandas
@jeremycg 这真的很有帮助也很有趣，谢谢 :)

标签： python performance pandas nested nested-loops

【解决方案1】：

一种方法是使用 here 中的 numpy hasrsine 函数，只需稍作修改，以便您可以考虑所需的半径。

只需使用 apply 遍历您的 df 并找到给定半径内最接近的值

def haversine_np(lon1, lat1, lon2, lat2,R):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    All args must be of equal length.    
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    if km.min() <= R:
        return km.argmin()
    else:
        return -1

df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)

然后合并两个数据框。

df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)

         lat        lon  stop_id    stoplat    stoplon
0  43.657676 -79.380146      0.0  43.657675 -79.380145
1  43.694324 -79.334555      NaN        NaN        NaN

注意：如果您选择遵循此方法，则必须确保两个数据帧索引都已重置，或者它们从 0 到 df 的总长度按顺序排列。因此，请务必在运行之前重置索引。

df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True)

【讨论】：

这对于加速算法非常有用！从几小时到几秒，也比我使用上面提到的 pycon 优化实现的方法快几秒钟。非常感谢！