【问题标题】:Should I use itertuples or iterrows for iteration of Dataframes我应该使用 itertuples 还是 iterrows 来迭代 Dataframe
【发布时间】:2021-08-18 21:21:47
【问题描述】:

这是一个宠物项目,我将 CSV 解析为人类可读的格式,如 *.txt、CSV(CSV 可能包含 100k+ 行)

像这样,

name,type,start_time,duration,ack,address,read,data
I2C,start,23.6799126,8.00E-09,,,,
I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
I2C,start,23.6799367,8.00E-09,,,,
I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
I2C,stop,23.6799619,8.00E-09,,,,

对于每个行项目,我将解码它是否是 type==Start 或 address 或 data 或 stop,然后适当地解析其他值,因为我对整个数据帧使用 itertuples。我还从 JSON 文件示例中获取输入,将其与 CSV 相关联。

 {
     "name":"IOUT_LIMIT",
     "address":"0x02",
     "Formulate":"1",
     "Data_Width":"1",
     "Mask":"0x7F",
     "Weightage":"50",
     "Offset":"0",
     "Units":"mA",
     "BitFields":[
        {
           "name":"",
           "start":0,
           "end":7
        }
     ]
  },

并输出为

Transaction started for:  Read Data From  IOUT_LIMIT 3100mA
Transaction started for:  Write Data To  IOUT_LIMIT 3250mA

示例代码

for row in df.itertuples():
        
        #Check for Start or Stop; I2C Start is used to start a transaction, If we encounter Repeated start it's a read.

        if(row.type == 'start'):
            
            #Fresh Start Encountered
            if(Transaction_Started == 0):
                String = "Transaction started for : "
                Transaction_Started = 1
            else:
                #Repeated start encountered
                Repeated_Start = 1
                
        elif (row.type == 'address'):
            
            if(Repeated_Start == 0):
                Slave_address = row.address
            
        elif(row.type == 'data'):
            
            #Append read data to a list "Data"
            Data.append(row.data)
            
            if(Repeated_Start):
                String+=" Read Data From"
            else:
                if(DataWrite == 1):
                    String+=" Write Data To"
                    
            DataWrite+=1
        
        elif(row.type == 'stop'):
            #Iterate over Regcmd_List in the JSON
            for i in DataSheet_Data['Regcmd_List']:
                
                #If Address is Hit Get the Register name and Store in Output 
                Data and move along
                #Data[0] will have address byte
                if(i['address'] == Data[0]):
                    Output_Data = i['name']
                    
                    #Get the Datawidth
                    Datawidth = int(i['Data_Width'],0)
                    Temp_Data = ""
                    
                    #Iterate the Stored Data from Data[1] and Store them as 
                     a single value in Temp_Data
                    #Temp_data will have only Data, Address byte will be 
                    excluded since we are iterating from Index[Datawidth]
                    #to Index[1]
                    for j in range(Datawidth,0,-1):
                        Data[j] = Data[j].replace("0x","")
                        Temp_Data+=Data[j]
                    
                    #Check for Formula and apply over the Temp_Data
                    if(i['Formulate'] == '1'):
                    
                        #Multiply the value with Weightage and add offset if 
                        mentoined in Json
                        Temp_Data = int(Temp_Data,16)
                        Temp_Data &= int(i['Mask'],0)
                        Temp_Data *= int(i['Weightage'])
                        if((i['Offset'] != "0") & (Temp_Data!=0)):
                            Temp_Data += int(i['Offset'])
                            
                        Temp_Data = str(Temp_Data)
                        
                        #Append this to Output_Data which will  have the  
                        Register Name + Value after Formula calculation
                        Output_Data += " " + Temp_Data
                        String_Units = i['Units']
                        
                    else:
                        #Append this to Output_Data which will  have the 
                        Register Name + Raw Bytes 
                        Output_Data += " " + "0x" +Temp_Data
                        
            print(String + " " + " " + Output_Data + String_Units,file=output_file)
            
            #Clear all the context when an I2C Stop is Encountered
            Transaction_Started = 0
            Slave_address = 0
            Repeated_Start = 0
            DataWrite = 0
            String_Units = ""
            Output_Data = ""
            Temp_Data = ""
            Datawidth = 0
            Data.clear()

对于这个应用程序我应该使用iterrows 还是itertuples 哪个更有效?

一种有效的方法是读取从遇到row.type == 'start'row.type == 'stop' 的N 行并处理帧。但即使是这样,我也可能需要迭代直到我停止并在遇到停止时进行大部分处理。如果有更多性能优势设计,请告诉我。

【问题讨论】:

  • 请以文字而非图片的形式提供您的数据。
  • 猜“L”是一个印度词的缩写。请不要在这里使用它们,我们并不都来自印度。
  • itertuples 总是比iterrow 快。两者都相对低效。您可能应该为此使用 csv 模块
  • 或许更重要的是,String += "some other string" 效率低下
  • 顺便说一句,您可以从给定的数据框中发布您的预期输出吗?

标签: python json pandas csv parsing


【解决方案1】:

如果您一次遍历 CSV 文件一行,它将避免将整个 100k 行文件加载到内存中。

我在下面的回答将每一行视为打印到标准输出的事件。但是,在您的原始帖子中,您的数据在“停止”事件之前有多个“开始”事件。 'stop' 事件是否会停止所有当前打开的交易?

def csv_to_message(filename) -> None:
    with open(filename, 'r') as f:
        first_line = next(f)
        headers = first_line.split(',')
        transaction = {}
        for row_str in f:
            row_array = row_str.strip().split(',')
            row_dict = {k: (v if v else None) for k,v in zip(headers, row_array)}

            if row_dict['type']=='start':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='address':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='data':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='stop':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
                # Clear the transaction dictionary
                transaction = {}

            # print(f'Transaction {"started" if row_dict["type"]=="start" else "stopped"} for {"READ" if row_dict["read"] is True else "WRITE"} {("data: " + transaction["data"]) if transaction["data"] else ""} at ...etc...')

最后的 print() 函数将为每个事件打印一次,如果您使用我的示例中的 f-string 表达式,它可以总结当前 row_dict 中的信息(用于单个事件信息)或事务中的信息 (获取摘要信息)。您还可以随时形成字符串的条件组件,并在最后打印一次。

我意识到这不使用熊猫,但是您描述的场景似乎也不需要将整个熊猫模块加载到内存中。尽可能避免使用外部模块总是很好。

此外,由于此 python 的 open() 函数可以接受类似路径的对象作为输入,因此您可以从标准输入或其他来源进行流式传输。归根结底,您正在流式传输数据,应该使用带有“事务”变量的流式传输模式作为一种缓冲区,而不是在操作之前将所有内容加载到内存中。

【讨论】:

    【解决方案2】:

    我不知道您的进一步计算是什么,但一种方法可能是将第一次启动和停止之间的行关联起来,然后通过分组处理它们,以获得总持续时间、启动次数等。

    假设这是名为 IC2.csv 的 csv:

    name,type,start_time,duration,ack,address,read,data
    I2C,start,23.6799126,8.00E-09,,,,
    I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
    I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
    I2C,start,23.6799367,8.00E-09,,,,
    I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
    I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
    I2C,stop,23.6799619,8.00E-09,,,,
    I2C,start,23.6799126,8.00E-09,,,,
    I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
    I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
    I2C,start,23.6799367,8.00E-09,,,,
    I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
    I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
    I2C,stop,23.6799619,8.00E-09,,,,
    

    然后你就可以阅读了:

    import pandas as pd
    df=pd.read_csv('ic2.csv')
    

    获取这样的索引数据框:

    name    type    start_time  duration    ack     address     read    data
    0   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN
    1   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN
    2   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02
    3   I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN
    4   I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN
    5   I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2
    6   I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN
    7   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN
    8   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN
    9   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02
    10  I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN
    11  I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN
    12  I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2
    13  I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN
    

    然后你可以生成组,例如:

    # find indices of type stop
    indstop=df.index[df['type']=='stop'].to_list()
    
    # set new column group flagging the stops
    df.loc[df['type']=='stop', 'group'] = indstop
    
    # assign group to the remaining rows
    df['group'].fillna(method='bfill', inplace=True)
    

    现在你的 df 看起来像这样:

    name    type    start_time  duration    ack     address     read    data    group
    0   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN     6.0
    1   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN     6.0
    2   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02    6.0
    3   I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN     6.0
    4   I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN     6.0
    5   I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2    6.0
    6   I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN     6.0
    7   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN     13.0
    8   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN     13.0
    9   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02    13.0
    10  I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN     13.0
    11  I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN     13.0
    12  I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2    13.0
    13  I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN     13.0
    

    您现在可以按组进行聚合,例如:

    df.groupby('group').agg({'duration':'sum'})
    
    
        duration
    group   
    6.0     0.000034
    13.0    0.000034
    

    【讨论】:

    • 但分组后我想解析每一行,将数据与 json 进行比较并获得可读的输出
    • 我明白了,这是一项相当复杂的工作。一旦您可以做的是将 json 文件带到数据框,比较然后将结果转换为 json,正如 cmets 中的 Nk03 所提到的,df.to_dict() 也会有所帮助。但是,如果按照您的原始问题逐行进行,那么 itertuples 将比 iterrows 快;这里有一篇很好的文章:medium.com/swlh/…
    【解决方案3】:

    事实证明,itertuples() 总是比 iterrows() 快。

    了解速度差异的起点是通过分析器运行这两种解决方案。分析器是一种工具,它将执行给定的代码,同时跟踪每个函数的调用次数及其执行时间。这样,您就可以开始优化过程,将注意力集中在最耗时的函数上。

    Python 带有一个可以方便地调用的内置分析器 来自使用 %%prun 单元魔法的 Jupyter 笔记本。

    参考https://docs.python.org/3/library/profile.html

    【讨论】:

      猜你喜欢
      • 2014-04-27
      • 2014-04-27
      • 1970-01-01
      • 1970-01-01
      • 2021-04-04
      • 2021-08-05
      • 2011-03-03
      • 2022-11-10
      • 2019-01-29
      相关资源
      最近更新 更多