【问题标题】:Alternative to for loop to iterate through dataframe sequentially替代 for 循环按顺序遍历数据帧
【发布时间】:2017-12-11 13:33:40
【问题描述】:

我有一个非常大的数据框,用于一年中每一秒的电池放电时间表。

基本的一系列事件是

  • 事件发生
  • 电池放电
  • 停止放电
  • 放电停止X秒后,开始充电
  • 电池充满后停止充电

数据框看起来像这样......(原谅我的格式不好)

Index | Freq | Case | Battery OP | Power Required | Battery Energy | SOC | Response timer | Charge Power |

01/01/2016 | 49.5862 | C | Discharging | 300.512 | 1500 | 99.85 | 3 | 0 |

01/01/2016 | 49.5862 | C | Charging    | 0       | 1500 | 99.85 | 3 | 1500 |

我目前正在使用 for 循环和一些 if/elseif 语句来遍历每一行,检查电池是否需要充电。

我认为它的效率非常低。我可能内存不足,或者需要几天时间才能完成。

我让它在周末运行,但它仍然没有完成

我确信有更好的方法可以做到这一点,但我不知道。问题是它必须是顺序的。充电状态或电池电量需要根据输入或输出电池的电量以及之前的 SOC%/能量计算每秒的电量。

此处可重现的代码(尽量减少)

import numpy as np
import pandas as pd


Battery_W = 1000
Battery_Wh = 1000/ 3
starting_SOC = 0.75
charge_delay = 5
charging = False

year_test = pd.DataFrame(data = [50.00,50.00,49.99,49.98,49.87,49.76,49.65,49.25,50.00,50.00,50.00,50.00,50.00,50.00,49.99,49.78,49.67,49.46,49.25,49.25,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,50.00,50.00,50.00,50.00,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,49.99,49.68,50.00,50.00,50.00,50.00,50.00,50.00,50.00,50.00],index = range(0,50),columns= ['Freq'])


case_conditions = [
    (year_test['Freq'] <= 49.75 ),                                 
    (year_test['Freq'] > 49.75 )   
    ]
choices = ['C', 'B']
year_test['Case'] = np.select(case_conditions, choices, default='No Case')

"Battery Operation mode"
op_conditions = [
        (year_test['Case'] == 'C'),
        (year_test['Case'] == 'B')
]
#%%
op_choices = ['Discharging','Idle']
year_test['Battery OP']= np.select(op_conditions, op_choices, default = 'No OP Mode')

"Calculate power output required"

power_conditions = [
        (year_test['Case'] == 'B'),
        (year_test['Case'] == 'C')
]

power_choices = [1000,0]
year_test['Power Required']= np.select(power_conditions, power_choices, default = 0)

year_test['Battery Energy'] = 0.0
year_test['SOC%'] = 0

"Response Timer"
year_test['Response timer'] = year_test.groupby('Battery OP').cumcount()
year_test['Response timer'][year_test['Battery OP'] == 'Idle' ] = 0

year_test['Charge Power'] = 0.00


year_test['Battery Energy'] = 0.0
year_test['Battery Energy'].iloc[0] = Battery_Wh * starting_SOC 
year_test['Battery Energy'].iloc[0:charge_delay] = Battery_Wh * starting_SOC


for j in range(charge_delay, len(year_test)):
    if year_test.iloc[j-(charge_delay) ,3]  > 0 and year_test.iloc[j - ((charge_delay) -1), 3] == 0 :
        "charge at max rate"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"
        charging = True

    elif charging == True and year_test.iloc[j-1,4] < starting_SOC * Battery_Wh:
        "check if battery charged"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"

    elif year_test.iloc[j-1,4] >= starting_SOC * Battery_Wh or charging == False:
        charging = False
        year_test.iloc[j,7] = 0.0

    "New Battery Energy"    
    year_test.iloc[j,4] = year_test.iloc[(j-1),4] - ((year_test.iloc[j,3])/60/60) + ((year_test.iloc[j,7])/60/60)
    if year_test.iloc[j,4] > Battery_Wh :
        year_test.iloc[j,4] = Battery_Wh

"Calculate battery SOC% for empty"

year_test['SOC%'] = year_test['Battery Energy'] / Battery_Wh * 100

【问题讨论】:

  • 您能否给出一个简化为相关字段和预期输出的示例数据框?阅读您的代码相当困难。
  • 我同意 Tillmann 的观点,如果您能提供 mcve,那将很容易为您提供帮助。
  • 好的,我现在试试添加一个
  • 我认为您可能在代码开头缺少布尔值charging=True...无论如何,问题显然是在您创建第一列“频率”后追加新列。您有 2 个选项:从字典操作中获取 'Freq' 和 'Case' 列或使用 df.itertuples()...让我尝试将其包装在快速代码中...
  • 你的代码有点不清楚 - 你有什么数据,你想生成什么数据?您能否给出一个示例输入数据框和一个示例输出,以及您期望输出的规则是什么?

标签: python pandas loops


【解决方案1】:

由于内存不足,最好的方法是使用 panda 数据帧的 apply 方法。这种方法称为矢量化。

一个例子如下df.apply(numpy.sqrt, axis=1)

您可以查看文档了解更多详细信息:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

【讨论】:

  • 当当前行的值依赖于之前的行时,你知道 .apply 是否会起作用
  • 我不确定它是否这样做,但 .apply 的好处是它需要一个函数。因此,如果您需要前一行的值,您可以将其写入 .apply 将调用的函数中。我希望这已经足够清楚了吗?
  • 是的,我要试试 .apply with shift 看看效果如何
【解决方案2】:

这就是我可能重写您的代码的方式。我只是将最初的 7 列简化为字典,然后使用 pd.DataFrame() 将它们转换为正确的 DataFrame。然后,我只需在遍历构造的 DataFrame 时应用您的 if...elif 语句。

import numpy as np
import pandas as pd


Battery_W = 1000
Battery_Wh = 1000/ 3
starting_SOC = 0.75
charge_delay = 5
charging = True

#initialize test Dictionary 
test = {}

#add your test elements as a tuple
data = (50.00,50.00,49.99,49.98,49.87,49.76,49.65,49.25,50.00,50.00,50.00,50.00,50.00,50.00,49.99,49.78,49.67,49.46,49.25,49.25,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,50.00,50.00,50.00,50.00,50.00,50.00,50.00,49.95,49.65,49.45,49.65,49.55,49.99,49.68,50.00,50.00,50.00,50.00,50.00,50.00,50.00,50.00)
index = 0
#"Battery Operation mode" is not calculated seperately now
#"Calculate power output required" is not calculated seperately now
for d in data:
    if d <= 49.75:
    test[index] = {'Freq': d,
                   'Case': 'C',
                   'Battery_OP':'Discharging',
                   'Power_Required':0,
                   'Battery_Energy':0.0,
                   'SOC':0,
                   'Charge_Power' :0.0
                   }
    elif d > 49.75:
        test[index] = {'Freq': d,
                   'Case': 'B',
                   'Battery_OP': 'Idle',
                   'Power_Required': 1000,
                   'Battery_Energy': 0.0,
                   'SOC': 0,
                   'Charge_Power': 0.0}
   index +=1
#This is how I convert the dictionary into a df for the first-time
year_test = pd.DataFrame(test.values())

year_test['Response_timer'] = year_test.groupby('Battery_OP').cumcount()
year_test['Response_timer'][year_test['Battery_OP'] == 'Idle'] = 0

year_test['ChargePower'] = 0.00
year_test['BatteryEnergy'] = 0.0
year_test['BatteryEnergy'].iloc[0:charge_delay] = Battery_Wh * starting_SOC

j = charge_delay
#instead of using the range(), try to manipulate it using `itertuples()`
#This is most probably where you are losing your time..
for row in year_test.itertuples():
    if row.Index <5:
        continue
    if year_test.iloc[j-charge_delay, 3]  > 0 and year_test.iloc[j - ((charge_delay) -1), 3] == 0 :
        "charge at max rate"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"
        charging = True

    elif charging == True and year_test.iloc[j-1,4] < starting_SOC * Battery_Wh:
        "check if battery charged"
        year_test.iloc[j,7] = Battery_W
        year_test.iloc[j,2] = "Charging"

    elif year_test.iloc[j-1,4] >= starting_SOC * Battery_Wh or charging == False:
        charging = False
        year_test.iloc[j,7] = 0.0

    "New Battery Energy"
    year_test.iloc[j,4] = year_test.iloc[(j-1),4] - ((year_test.iloc[j,3])/60/60) + ((year_test.iloc[j,7])/60/60)
    if year_test.iloc[j,4] > Battery_Wh :
        year_test.iloc[j,4] = Battery_Wh

"Calculate battery SOC% for empty"
year_test['SOC'] = year_test['BatteryEnergy'] / Battery_Wh * 100

【讨论】:

  • 谢谢,我试试看
  • 我尝试了这种方法,并且我正在打印大约每 1000 行的循环进度,但按照目前的速度,我仍在查看每月 27 小时的数据:(
  • 与其将字典加载到pd.dataFrame(test.values()) 中的普通数据框,不如尝试将其添加到羽毛、镶木地板或尝试使用 Apache Arrow 以加快数据访问和分析速度。This 可能有助于获取开始...
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-06-25
  • 2021-01-07
  • 2021-12-15
  • 1970-01-01
  • 2021-11-22
相关资源
最近更新 更多