我该如何优化这个 python 循环？答案

【问题标题】：how could i optimise this python loop?我该如何优化这个 python 循环？
【发布时间】：2020-03-12 15:30:16
【问题描述】：

我在一个大型 csv 文件（150 万行）上运行此代码。有没有办法优化？

df 是一个熊猫数据框。我排了一行，想知道接下来的 1000 行中的第一行发生了什么：

我找到我的价值 + 0.0004 或者我找到我的价值 - 0.0004

result = []
for row in range(len(df)-1000):
    start = df.get_value(row,'A')
    win = start + 0.0004
    lose = start - 0.0004
    for n in range(1000):
        ref = df.get_value(row + n,'B')
        if ref > win:
            result.append(1)
            break
        elif ref <= lose:
            result.append(-1)
            break
        elif n==999 :
            result.append(0)

数据框是这样的：

         timestamp           A         B
0   20190401 00:00:00.127  1.12230  1.12236
1   20190401 00:00:00.395  1.12230  1.12237
2   20190401 00:00:00.533  1.12229  1.12234
3   20190401 00:00:00.631  1.12228  1.12233
4   20190401 00:00:01.019  1.12230  1.12234
5   20190401 00:00:01.169  1.12231  1.12236

结果是：result[0,0,1,0,0,1,-1,1,…]

这是可行的，但处理如此大的文件需要很长时间。

【问题讨论】：

你能分享预期的输出吗？
请发布示例数据框以及所需的输出。
@Cleb ：我添加了一个示例数据框，输出是一个包含 1、-1 或 0 值的列表。
所以如果 B 比 A 多于 0.004 那么你想将 1 添加到列表中，如果它小于 0.004 然后 -1 否则为 0？
@Datanovice ：我取给定行的值 A，并且想知道女巫案件首先发生在以下 1000 行中：-我在 B > A+0.0004 => 中找到了一个值 => 我返回1 - 或者我在 B 中找到了一个值 => 我返回 -1 - 我在 1000 行中一无所获 (A-0.0004 我返回 0

标签： python pandas performance for-loop

【解决方案1】：

要为“第一个异常值”生成值，请定义以下函数：

def firstOutlier(row, dltRow = 4, dltVal = 0.1):
    ''' Find the value for the first "outlier". Parameters:
    row    - the current row
    dltRow - number of rows to check, starting from the current
    dltVal - delta in value of "B", compared to "A" in the current row
    '''
    rowInd = row.name                        # Index of the current row
    df2 = df.iloc[rowInd : rowInd + dltRow]  # "dltRow" rows from the current
    outliers = df2[abs(df2.B - row.A) >= dlt]
    if outliers.index.size == 0:  # No outliers within the range of rows
        return 0
    return int(np.sign(outliers.iloc[0].B - row.A))

然后将其应用于每一行：

df.apply(firstOutlier, axis=1)

这个函数依赖于 DataFrame 的索引包含从 0 开始的连续数字，因此具有 ind - 的索引我们可以访问它的任何行调用df.iloc[ind] 和 n 行的切片，从这一行开始，调用df.iloc[ind : ind + n]。

对于我的测试，我将参数的默认值设置为：

dltRow = 4 - 查看 4 行，从当前行开始，
dltVal = 0.1 - 查找具有 B 列“距离”0.1 的行或更多来自当前行中的A。

我的测试数据框是：

      A     B
0  1.00  1.00
1  0.99  1.00
2  1.00  0.80
3  1.00  1.05
4  1.00  1.20
5  1.00  1.00
6  1.00  0.80
7  1.00  1.00
8  1.00  1.00

结果（对于我的数据和参数的默认值）是：

0   -1
1   -1
2   -1
3    1
4    1
5   -1
6   -1
7    0
8    0
dtype: int64

根据您的需要，将参数的默认值分别更改为 1000 和 0.0004。

【讨论】：

我认为您必须在“outliers = df2[abs(df2.B - row.A) >= dlt]”中为 dltVal 更改 dlt。谢谢你，我正在测试你的解决方案。
谢谢，在 10 000 行上，您的代码需要 8.5 秒，我的是 78.2 秒，处理 150 万行仍然需要很长时间，但这是一个很大的改进！

【解决方案2】：

这个想法是循环遍历A 和B，同时保持A 值的排序列表。然后，对于每个B，找出最大的A 输，最低的A 获胜。因为它是一个排序列表，所以要搜索 O(log(n))。只有那些在最后 1000 中具有索引的A 用于设置结果向量。之后，不再等待 B 的 A 将从该排序列表中删除以保持较小。

import numpy as np
import bisect
import time

N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4

A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)

l = []

t_start = time.time()

for i in range(N):
    a = (A[i],i)
    bisect.insort(l,a)
    b = B[i]
    firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
    lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
    for j in range(lastWinInd):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = 1
    for j in range(firstLoseInd,len(l)):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = -1
    del l[firstLoseInd:]
    del l[:lastWinInd]

t_done = time.time()

print(A)
print(B)
print(result)
print(t_done - t_start)

这是一个示例输出：

[ 0.22643589  0.96092354  0.30098532  0.15569044  0.88474775  0.25458535
  0.78248271  0.07530432  0.3460113   0.0785128 ]
[ 0.83610433  0.33384085  0.51055061  0.54209458  0.13556121  0.61257179
  0.51273686  0.54850825  0.24302884  0.68037965]
[ 1. -1.  0.  1. -1.  0. -1.  1.  0.  1.]

对于N = int(1e6) 和M = int(1e3)，在我的计算机上花费了大约 3.4 秒。

【讨论】：

可能不建议使用 for 循环，因为我们有可用的矢量化解决方案。 for 循环通常是最后的手段。