替代 for 循环按顺序遍历数据帧答案

【问题标题】：Alternative to for loop to iterate through dataframe sequentially替代 for 循环按顺序遍历数据帧
【发布时间】：2017-12-11 13:33:40
【问题描述】：

我有一个非常大的数据框，用于一年中每一秒的电池放电时间表。

基本的一系列事件是

事件发生
电池放电
停止放电
放电停止X秒后，开始充电
电池充满后停止充电

数据框看起来像这样......（原谅我的格式不好）

Index | Freq | Case | Battery OP | Power Required | Battery Energy | SOC | Response timer | Charge Power |

01/01/2016 | 49.5862 | C | Discharging | 300.512 | 1500 | 99.85 | 3 | 0 |

01/01/2016 | 49.5862 | C | Charging    | 0       | 1500 | 99.85 | 3 | 1500 |

我目前正在使用 for 循环和一些 if/elseif 语句来遍历每一行，检查电池是否需要充电。

我认为它的效率非常低。我可能内存不足，或者需要几天时间才能完成。

我让它在周末运行，但它仍然没有完成

我确信有更好的方法可以做到这一点，但我不知道。问题是它必须是顺序的。充电状态或电池电量需要根据输入或输出电池的电量以及之前的 SOC%/能量计算每秒的电量。

此处可重现的代码（尽量减少）

import numpy as np
import pandas as pd


Battery_W = 1000
Battery_Wh = 1000/ 3
starting_SOC = 0.75
charge_delay = 5
charging = False

year_test = pd.DataFrame(data = [50.00,50.00,49.99,49.98,49.87,49.76,49.65,49.25,50.00,50.00,50.00,50.00,50.00,50.00,49.99,49.78,49.67,49.46,49.25,49.25,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,50.00,50.00,50.00,50.00,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,49.99,49.68,50.00,50.00,50.00,50.00,50.00,50.00,50.00,50.00],index = range(0,50),columns= ['Freq'])


case_conditions = [
    (year_test['Freq'] <= 49.75 ),                                 
    (year_test['Freq'] > 49.75 )   
    ]
choices = ['C', 'B']
year_test['Case'] = np.select(case_conditions, choices, default='No Case')

"Battery Operation mode"
op_conditions = [
        (year_test['Case'] == 'C'),
        (year_test['Case'] == 'B')
]
#%%
op_choices = ['Discharging','Idle']
year_test['Battery OP']= np.select(op_conditions, op_choices, default = 'No OP Mode')

"Calculate power output required"

power_conditions = [
        (year_test['Case'] == 'B'),
        (year_test['Case'] == 'C')
]

power_choices = [1000,0]
year_test['Power Required']= np.select(power_conditions, power_choices, default = 0)

year_test['Battery Energy'] = 0.0
year_test['SOC%'] = 0

"Response Timer"
year_test['Response timer'] = year_test.groupby('Battery OP').cumcount()
year_test['Response timer'][year_test['Battery OP'] == 'Idle' ] = 0

year_test['Charge Power'] = 0.00


year_test['Battery Energy'] = 0.0
year_test['Battery Energy'].iloc[0] = Battery_Wh * starting_SOC 
year_test['Battery Energy'].iloc[0:charge_delay] = Battery_Wh * starting_SOC


for j in range(charge_delay, len(year_test)):
    if year_test.iloc[j-(charge_delay) ,3]  > 0 and year_test.iloc[j - ((charge_delay) -1), 3] == 0 :
        "charge at max rate"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"
        charging = True

    elif charging == True and year_test.iloc[j-1,4] < starting_SOC * Battery_Wh:
        "check if battery charged"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"

    elif year_test.iloc[j-1,4] >= starting_SOC * Battery_Wh or charging == False:
        charging = False
        year_test.iloc[j,7] = 0.0

    "New Battery Energy"    
    year_test.iloc[j,4] = year_test.iloc[(j-1),4] - ((year_test.iloc[j,3])/60/60) + ((year_test.iloc[j,7])/60/60)
    if year_test.iloc[j,4] > Battery_Wh :
        year_test.iloc[j,4] = Battery_Wh

"Calculate battery SOC% for empty"

year_test['SOC%'] = year_test['Battery Energy'] / Battery_Wh * 100

【问题讨论】：

您能否给出一个简化为相关字段和预期输出的示例数据框？阅读您的代码相当困难。
我同意 Tillmann 的观点，如果您能提供 mcve，那将很容易为您提供帮助。
好的，我现在试试添加一个
我认为您可能在代码开头缺少布尔值charging=True...无论如何，问题显然是在您创建第一列“频率”后追加新列。您有 2 个选项：从字典操作中获取 'Freq' 和 'Case' 列或使用 df.itertuples()...让我尝试将其包装在快速代码中...
你的代码有点不清楚 - 你有什么数据，你想生成什么数据？您能否给出一个示例输入数据框和一个示例输出，以及您期望输出的规则是什么？

标签： python pandas loops

【解决方案1】：

由于内存不足，最好的方法是使用 panda 数据帧的 apply 方法。这种方法称为矢量化。

一个例子如下df.apply(numpy.sqrt, axis=1)

您可以查看文档了解更多详细信息：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

【讨论】：

当当前行的值依赖于之前的行时，你知道 .apply 是否会起作用
我不确定它是否这样做，但 .apply 的好处是它需要一个函数。因此，如果您需要前一行的值，您可以将其写入 .apply 将调用的函数中。我希望这已经足够清楚了吗？
是的，我要试试 .apply with shift 看看效果如何

【解决方案2】：

这就是我可能重写您的代码的方式。我只是将最初的 7 列简化为字典，然后使用 pd.DataFrame() 将它们转换为正确的 DataFrame。然后，我只需在遍历构造的 DataFrame 时应用您的 if...elif 语句。

import numpy as np
import pandas as pd


Battery_W = 1000
Battery_Wh = 1000/ 3
starting_SOC = 0.75
charge_delay = 5
charging = True

#initialize test Dictionary 
test = {}

#add your test elements as a tuple
data = (50.00,50.00,49.99,49.98,49.87,49.76,49.65,49.25,50.00,50.00,50.00,50.00,50.00,50.00,49.99,49.78,49.67,49.46,49.25,49.25,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,50.00,50.00,50.00,50.00,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,49.99,49.68,50.00,50.00,50.00,50.00,50.00,50.00,50.00,50.00)
index = 0
#"Battery Operation mode" is not calculated seperately now
#"Calculate power output required" is not calculated seperately now
for d in data:
    if d <= 49.75:
    test[index] = {'Freq': d,
                   'Case': 'C',
                   'Battery_OP':'Discharging',
                   'Power_Required':0,
                   'Battery_Energy':0.0,
                   'SOC':0,
                   'Charge_Power' :0.0
                   }
    elif d > 49.75:
        test[index] = {'Freq': d,
                   'Case': 'B',
                   'Battery_OP': 'Idle',
                   'Power_Required': 1000,
                   'Battery_Energy': 0.0,
                   'SOC': 0,
                   'Charge_Power': 0.0}
   index +=1
#This is how I convert the dictionary into a df for the first-time
year_test = pd.DataFrame(test.values())

year_test['Response_timer'] = year_test.groupby('Battery_OP').cumcount()
year_test['Response_timer'][year_test['Battery_OP'] == 'Idle'] = 0

year_test['ChargePower'] = 0.00
year_test['BatteryEnergy'] = 0.0
year_test['BatteryEnergy'].iloc[0:charge_delay] = Battery_Wh * starting_SOC

j = charge_delay
#instead of using the range(), try to manipulate it using `itertuples()`
#This is most probably where you are losing your time..
for row in year_test.itertuples():
    if row.Index <5:
        continue
    if year_test.iloc[j-charge_delay, 3]  > 0 and year_test.iloc[j - ((charge_delay) -1), 3] == 0 :
        "charge at max rate"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"
        charging = True

    elif charging == True and year_test.iloc[j-1,4] < starting_SOC * Battery_Wh:
        "check if battery charged"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"

    elif year_test.iloc[j-1,4] >= starting_SOC * Battery_Wh or charging == False:
        charging = False
        year_test.iloc[j,7] = 0.0

    "New Battery Energy"
    year_test.iloc[j,4] = year_test.iloc[(j-1),4] - ((year_test.iloc[j,3])/60/60) + ((year_test.iloc[j,7])/60/60)
    if year_test.iloc[j,4] > Battery_Wh :
        year_test.iloc[j,4] = Battery_Wh

"Calculate battery SOC% for empty"
year_test['SOC'] = year_test['BatteryEnergy'] / Battery_Wh * 100

【讨论】：

谢谢，我试试看
我尝试了这种方法，并且我正在打印大约每 1000 行的循环进度，但按照目前的速度，我仍在查看每月 27 小时的数据:(
与其将字典加载到pd.dataFrame(test.values()) 中的普通数据框，不如尝试将其添加到羽毛、镶木地板或尝试使用 Apache Arrow 以加快数据访问和分析速度。This 可能有助于获取开始...