Python / Pandas for loop..为什么我的这么慢？答案

【问题标题】：Python / Pandas for loop.. Why is mine so slow?Python / Pandas for loop..为什么我的这么慢？
【发布时间】：2020-09-21 20:06:03
【问题描述】：

所以我是 python、jupyter 环境和 pandas 的新手。

我对 MATLAB 有一点涉猎，这就是我开始学习 Jupyter、python 和它的 pandas 库的原因。

我已经设法为一些数据处理设置了一个较大的数据框 (170 万x9)。

如果在 150 毫秒（150 行）内发生错误，我需要检测故障，在 MATLAB 中我可以合理地快速执行此操作〜几秒钟，但在 python 中，以下循环可能需要一个小时 + 我还没有耐心完成并且对我来说表明我的代码有问题？

我希望输出与我输入的数据相同的数据，再加上一个额外的列，默认情况下为 0，如果 TrqSpdQuadrant != UdUq_IqRsQuadrant 为 150 行，则为 1。

PosSpdData['Fault'] = 0
pd.options.mode.chained_assignment = None # The error for rewriting a column was annoying. <- why isn't this correct?
cnt = 0
for i in range(1, len(PosSpdData['Error_IqRs'])):       
    if (PosSpdData['Error_IqRs'].values[i] == 0):  
        cnt += 1
        if cnt > 150:
            PosSpdData['Fault'][i] = 1  
        else:
            PosSpdData['Fault'][i] = 0
    else:
        cnt = 0
        PosSpdData['Fault'][i] = 0

DemandedTorque  Speed   Ud  Uq  Iq  TrqSpdQuadrant  Uq_IqRs     UdUqQuadrant    UdUq_IqRsQuadrant   Error   Error_IqRs
0   0.0     0.0     0.00000     0.00000     0.0000  0   0.000000    0   0   0   0
1   0.0     0.0     0.00000     0.00000     0.0000  0   0.000000    0   0   0   0
2   0.0     0.0     0.00000     0.00000     0.0000  0   0.000000    0   0   0   0
3   0.0     0.0     0.00000     0.00000     0.0000  0   0.000000    0   0   0   0
4   0.0     0.0     0.00000     0.00000     0.0000  0   0.000000    0   0   0   0
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
30302   270.0   847.0   -25.40625   30.75000    461.0625    1   17.162577   1   1   0   0
30303   270.0   847.0   -25.40625   30.75000    463.1875    1   17.099954   1   1   0   0
30304   270.0   847.0   -25.40625   30.75000    463.1875    1   17.099954   1   1   0   0
30305   270.0   847.0   -25.93750   30.75000    463.1875    1   17.099954   1   1   0   0
30306   270.0   847.0   -25.93750   29.34375    463.1875    1   15.693704   1   1   0   0

【问题讨论】：

请提供样本数据，可能是numpy.random.randint
请记住，MATLAB 进行了大量的jit 编译，这使您可以编写迭代代码而不会降低性能。 numpy（如果是上面的 pandas）不会这样做 - 它在使用“矢量化/全数组”方法时具有最佳性能（就像旧版 MATLAB）。
你不应该在 pandas 或 numpy 中使用 for 循环。那会很慢。为了帮助您，我们需要您的数据和预期输出示例。

标签： python pandas jupyter

【解决方案1】：

这会将 'True' 添加到 'Fault' 列，当 n > 150 连续行指向错误时，否则为 'False'

PosSpdData['Error_IqRs'] = PosSpdData['TrqSpdQuadrant'] != PosSpdData['UdUq_IqRsQuadrant']
data = PosSpdData['Error_IqRs'].values
PosSpdData['gr'] = np.r_[True, data[1:] != data[:-1]].cumsum()
PosSpdData['Fault'] = PosSpdData.groupby('gr')['Error_IqRs'].transform('sum') > 150

n = 2_000_000 行 x 2 列的基准

307 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：