与迭代两个大型 Pandas 数据框相比，效率更高答案

【问题标题】：Improved efficiency versus iterating over two large Pandas Dataframes与迭代两个大型 Pandas 数据框相比，效率更高
【发布时间】：2019-04-28 13:36:44
【问题描述】：

我有两个具有基于位置的值的 HUGE Pandas 数据帧，我需要使用来自 df2 的记录数更新 df1['count']，这些记录数距离 df1 中的每个点都小于 1000m。

这是我导入到 Pandas 中的数据示例

df1 =       lat      long    valA   count
        0   123.456  986.54  1      0
        1   223.456  886.54  2      0
        2   323.456  786.54  3      0
        3   423.456  686.54  2      0
        4   523.456  586.54  1      0

df2 =       lat      long    valB
        0   123.456  986.54  1
        1   223.456  886.54  2
        2   323.456  786.54  3
        3   423.456  686.54  2
        4   523.456  586.54  1

实际上，df1 有大约 1000 万行，df2 有大约 100 万行

我使用 Pandas DF.itertuples() 方法创建了一个有效的嵌套 FOR 循环，该方法适用于较小的测试数据集（df1=1k Rows & df2=100 Rows 大约需要一个小时才能完成），但完整的数据set 呈指数级增长，根据我的计算需要数年才能完成。这是我的工作代码...

import pandas as pd
import geopy.distance as gpd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

df1.sort_values(['long', 'lat']), inplace=True) 
df2.sort_values(['long', 'lat']), inplace=True)

for irow in df1.itertuples():    
     count = 0
     indexLst = []        
     Location1 = (irow[1], irow[2])    

     for jrow in df2.itertuples():  
          Location2 = (jrow[1], jrow[2])                                      
          if gpd.distance(Location1, Location2).kilometers < 1:
             count += 1
             indexLst.append(jrow[0])    
     if count > 0:                  #only update DF if a match is found
         df1.at[irow[0],'count'] = (count)      
         df2.drop(indexLst, inplace=True)       #drop rows already counted from df2 to speed up next iteration

 #save updated df1 to new csv file
 outFileName = 'combined.csv'
 df1.to_csv(outFileName, sep=',', index=False)

df2 中的每个点只需要计算一次，因为 df1 中的点是均匀分布的。为此，我添加了一个 drop 语句，以便在计算完行后从 df2 中删除行，以期缩短迭代时间。我最初也尝试创建一个合并/连接语句，而不是嵌套循环，但没有成功。

现阶段，非常感谢您对提高效率的任何帮助！

编辑：目标是用 df2 中

df1 =       lat      long    valA   count
        0   123.456  986.54  1      3
        1   223.456  886.54  2      1
        2   323.456  786.54  3      9
        3   423.456  686.54  2      2
        4   523.456  586.54  1      5

【问题讨论】：

欢迎@dP8884，为了澄清问题，我理解这段代码的意图是从df1 获取一对纬度/经度，然后添加到纬度数的计数器/df2 中距离不到 1 公里的 /long 点？所以最后你会得到df1 中的纬度/经度，并更新到count 列，它在df2 中找到的点数小于1 公里？
是的，没错。我将更新我的问题以反映预期的输出应该是什么样子。谢谢。
应该有一种方法可以根据您知道超出范围的 lat/long 组合进行某种过滤（即，lat 或 long 相距超过一个度数），但我不知道不知道在你的情况下最好的方法。

标签： python pandas performance loops dataframe

【解决方案1】：

经常做这种事情，我发现了几个最佳实践：

1) 尽量使用numpy和numba

2) 尽量利用并行化

3) 跳过向量化代码的循环（我们在这里使用带有 numba 的循环来利用并行化）。

在这种特殊情况下，我想指出 geopy 带来的减速。虽然它是一个很棒的包并且可以产生非常准确的距离（与 Haversine 方法相比），但它的速度要慢得多（没有研究过实现的原因）。

import numpy as np
from geopy import distance

origin = (np.random.uniform(-90,90), np.random.uniform(-180,180))
dest = (np.random.uniform(-90,90), np.random.uniform(-180,180))

%timeit distance.distance(origin, dest)

每个循环 216 µs ± 363 ns（7 次运行的平均值 ± 标准偏差，每次 1000 个循环）

这意味着在该时间间隔内，计算 1000 万 x 100 万距离大约需要 2160000000 秒或 60 万小时。即使是并行也只能起到这么大的作用。

因为当点非常接近时您会感兴趣，我建议使用Haversine distance（在更远的距离处不太准确）。

from numba import jit, prange, vectorize

@vectorize
def haversine(s_lat,s_lng,e_lat,e_lng):

    # approximate radius of earth in km
    R = 6373.0

    s_lat = s_lat*np.pi/180.0                      
    s_lng = np.deg2rad(s_lng)     
    e_lat = np.deg2rad(e_lat)                       
    e_lng = np.deg2rad(e_lng)  

    d = np.sin((e_lat - s_lat)/2)**2 + np.cos(s_lat)*np.cos(e_lat) * np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

%timeit haversine(origin[0], origin[0], dest[1], dest[1])

每个循环 1.85 µs ± 53.9 ns（7 次运行的平均值 ± 标准偏差，每次 100000 次循环）

这已经是 100 倍的改进。但我们可以做得更好。您可能已经注意到我从 numba 添加的 @vectorize 装饰器。这允许之前的标量 Haversine 函数被向量化，并将向量作为输入。我们将在下一步中利用这一点：

@jit(nopython=True, parallel=True)
def get_nearby_count(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: Second list of coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        x, y = coords[i]
        # returns an array of length k
        dist = haversine(x, y, coords2[:,0], coords2[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

希望您现在拥有一个等于第一组坐标长度的数组（在您的情况下为 1000 万）。然后，您可以将其分配给您的数据框作为您的计数！

测试时间 100,000 x 10,000：

n = 100_000
k = 10_000

coords1 = np.zeros((n, 2))
coords2 = np.zeros((k, 2))

coords1[:,0] = np.random.uniform(-90, 90, n)
coords1[:,1] = np.random.uniform(-180, 180, n)
coords2[:,0] = np.random.uniform(-90, 90, k)
coords2[:,1] = np.random.uniform(-180, 180, k)

%timeit get_nearby_count(coords1, coords2, 1.0)

每个循环 2.45 秒 ± 73.2 毫秒（7 次运行的平均值 ± 标准偏差，每次 1 个循环）

不幸的是，这仍然意味着您将看到大约 20,000 多秒的内容。这是在具有 80 个内核的机器上（使用 76ish，基于 top 使用情况）。

这是我目前能做的最好的事情，祝你好运（另外，第一次发帖，感谢你激励我做出贡献！）

PS：您还可以查看 Dask 数组和函数 map_block()，以并行化此函数（而不是依赖 prange）。您如何对数据进行分区可能会影响总执行时间。

PPS：1,000,000 x 100,000（比您的全套设备小 100 倍）耗时：3 分 27 秒（207 秒），因此缩放看起来是线性的并且有点宽容。

PPPS：使用简单的纬度差过滤器实现：

@jit(nopython=True, parallel=True)
def get_nearby_count_vlat(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: List of port coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    coords2_abs = np.abs(coords2)
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        point = coords[i]
        # subsetting coords2 to reduce haversine calc time. Value .02 is from playing with Gmaps and will need to change for max_dist > 1.0
        coords2_filtered = coords2[np.abs(point[0] - coords2[:,0]) < .02]
        # in case of no matches
        if coords2_filtered.shape[0] == 0: continue
        # returns an array of length k
        dist = haversine(point[0], point[1], coords2_filtered[:,0], coords2_filtered[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

【讨论】：

谢谢！这很好，解释得很好。让我再消化一下，并在我的数据样本上实现它，然后我会让你知道结果如何。此外，这也是我的第一篇文章，所以感谢您的反馈，因为它似乎给了我足够的特权来开始投票（当然是你的第一个）。 :)
Eliot K 提出了一个很好的观点，即通过减少搜索空间来加快速度，但地理坐标让我头疼。我想我找到了一种快速过滤结果的方法，但仅限于纬度。我在我的笔记本电脑（四核）上快速测试了它。我的原始方法耗时 70 秒（100k 坐标 x 50k 坐标），而快速纬度距离过滤器将其缩短至 2.27 秒。那是完全随机的坐标。您绝对可以改进过滤，尤其是使用排序的 df2。我将更改添加到上面的代码中。过滤经度似乎不值得（与 Haversine 相比成本更高）
感谢 ernestk 和 Eliot K。这似乎使我的流程从大约 90 年缩短到

【解决方案2】：

我最近做了类似的事情，但不是纬度，经度，我只需要找到最近的点和它的距离。为此，我使用了 scipy.spatial.cKDTree 包。这是相当快的。 cKDTree

我认为在您的情况下，您可以使用 query_ball_point() 函数。

from scipy import spatial
import pandas as pd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Build the index
tree = spatial.cKDTree(df1[['long', 'lat']])
# Then query the index

你应该试一试。

【讨论】：