我应该使用 itertuples 还是 iterrows 来迭代 Dataframe答案

【问题标题】：Should I use itertuples or iterrows for iteration of Dataframes我应该使用 itertuples 还是 iterrows 来迭代 Dataframe
【发布时间】：2021-08-18 21:21:47
【问题描述】：

这是一个宠物项目，我将 CSV 解析为人类可读的格式，如 *.txt、CSV（CSV 可能包含 100k+ 行）

像这样，

name,type,start_time,duration,ack,address,read,data
I2C,start,23.6799126,8.00E-09,,,,
I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
I2C,start,23.6799367,8.00E-09,,,,
I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
I2C,stop,23.6799619,8.00E-09,,,,

对于每个行项目，我将解码它是否是 type==Start 或 address 或 data 或 stop，然后适当地解析其他值，因为我对整个数据帧使用 itertuples。我还从 JSON 文件示例中获取输入，将其与 CSV 相关联。

 {
     "name":"IOUT_LIMIT",
     "address":"0x02",
     "Formulate":"1",
     "Data_Width":"1",
     "Mask":"0x7F",
     "Weightage":"50",
     "Offset":"0",
     "Units":"mA",
     "BitFields":[
        {
           "name":"",
           "start":0,
           "end":7
        }
     ]
  },

并输出为

Transaction started for:  Read Data From  IOUT_LIMIT 3100mA
Transaction started for:  Write Data To  IOUT_LIMIT 3250mA

示例代码

for row in df.itertuples():
        
        #Check for Start or Stop; I2C Start is used to start a transaction, If we encounter Repeated start it's a read.

        if(row.type == 'start'):
            
            #Fresh Start Encountered
            if(Transaction_Started == 0):
                String = "Transaction started for : "
                Transaction_Started = 1
            else:
                #Repeated start encountered
                Repeated_Start = 1
                
        elif (row.type == 'address'):
            
            if(Repeated_Start == 0):
                Slave_address = row.address
            
        elif(row.type == 'data'):
            
            #Append read data to a list "Data"
            Data.append(row.data)
            
            if(Repeated_Start):
                String+=" Read Data From"
            else:
                if(DataWrite == 1):
                    String+=" Write Data To"
                    
            DataWrite+=1
        
        elif(row.type == 'stop'):
            #Iterate over Regcmd_List in the JSON
            for i in DataSheet_Data['Regcmd_List']:
                
                #If Address is Hit Get the Register name and Store in Output 
                Data and move along
                #Data[0] will have address byte
                if(i['address'] == Data[0]):
                    Output_Data = i['name']
                    
                    #Get the Datawidth
                    Datawidth = int(i['Data_Width'],0)
                    Temp_Data = ""
                    
                    #Iterate the Stored Data from Data[1] and Store them as 
                     a single value in Temp_Data
                    #Temp_data will have only Data, Address byte will be 
                    excluded since we are iterating from Index[Datawidth]
                    #to Index[1]
                    for j in range(Datawidth,0,-1):
                        Data[j] = Data[j].replace("0x","")
                        Temp_Data+=Data[j]
                    
                    #Check for Formula and apply over the Temp_Data
                    if(i['Formulate'] == '1'):
                    
                        #Multiply the value with Weightage and add offset if 
                        mentoined in Json
                        Temp_Data = int(Temp_Data,16)
                        Temp_Data &= int(i['Mask'],0)
                        Temp_Data *= int(i['Weightage'])
                        if((i['Offset'] != "0") & (Temp_Data!=0)):
                            Temp_Data += int(i['Offset'])
                            
                        Temp_Data = str(Temp_Data)
                        
                        #Append this to Output_Data which will  have the  
                        Register Name + Value after Formula calculation
                        Output_Data += " " + Temp_Data
                        String_Units = i['Units']
                        
                    else:
                        #Append this to Output_Data which will  have the 
                        Register Name + Raw Bytes 
                        Output_Data += " " + "0x" +Temp_Data
                        
            print(String + " " + " " + Output_Data + String_Units,file=output_file)
            
            #Clear all the context when an I2C Stop is Encountered
            Transaction_Started = 0
            Slave_address = 0
            Repeated_Start = 0
            DataWrite = 0
            String_Units = ""
            Output_Data = ""
            Temp_Data = ""
            Datawidth = 0
            Data.clear()

对于这个应用程序我应该使用iterrows 还是itertuples 哪个更有效？

一种有效的方法是读取从遇到row.type == 'start' 到row.type == 'stop' 的N 行并处理帧。但即使是这样，我也可能需要迭代直到我停止并在遇到停止时进行大部分处理。如果有更多性能优势设计，请告诉我。

【问题讨论】：

请以文字而非图片的形式提供您的数据。
猜“L”是一个印度词的缩写。请不要在这里使用它们，我们并不都来自印度。
itertuples 总是比iterrow 快。两者都相对低效。您可能应该为此使用 csv 模块
或许更重要的是，String += "some other string" 效率低下
顺便说一句，您可以从给定的数据框中发布您的预期输出吗？

标签： python json pandas csv parsing

【解决方案1】：

如果您一次遍历 CSV 文件一行，它将避免将整个 100k 行文件加载到内存中。

我在下面的回答将每一行视为打印到标准输出的事件。但是，在您的原始帖子中，您的数据在“停止”事件之前有多个“开始”事件。 'stop' 事件是否会停止所有当前打开的交易？

def csv_to_message(filename) -> None:
    with open(filename, 'r') as f:
        first_line = next(f)
        headers = first_line.split(',')
        transaction = {}
        for row_str in f:
            row_array = row_str.strip().split(',')
            row_dict = {k: (v if v else None) for k,v in zip(headers, row_array)}

            if row_dict['type']=='start':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='address':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='data':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
            elif row_dict['type']=='stop':
                # update data in transaction
                # Augment transaction with data from JSON...
                # optionally choose to print message
                # Clear the transaction dictionary
                transaction = {}

            # print(f'Transaction {"started" if row_dict["type"]=="start" else "stopped"} for {"READ" if row_dict["read"] is True else "WRITE"} {("data: " + transaction["data"]) if transaction["data"] else ""} at ...etc...')

最后的 print() 函数将为每个事件打印一次，如果您使用我的示例中的 f-string 表达式，它可以总结当前 row_dict 中的信息（用于单个事件信息）或事务中的信息 (获取摘要信息）。您还可以随时形成字符串的条件组件，并在最后打印一次。

我意识到这不使用熊猫，但是您描述的场景似乎也不需要将整个熊猫模块加载到内存中。尽可能避免使用外部模块总是很好。

此外，由于此 python 的 open() 函数可以接受类似路径的对象作为输入，因此您可以从标准输入或其他来源进行流式传输。归根结底，您正在流式传输数据，应该使用带有“事务”变量的流式传输模式作为一种缓冲区，而不是在操作之前将所有内容加载到内存中。

【讨论】：

【解决方案2】：

我不知道您的进一步计算是什么，但一种方法可能是将第一次启动和停止之间的行关联起来，然后通过分组处理它们，以获得总持续时间、启动次数等。

假设这是名为 IC2.csv 的 csv：

name,type,start_time,duration,ack,address,read,data
I2C,start,23.6799126,8.00E-09,,,,
I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
I2C,start,23.6799367,8.00E-09,,,,
I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
I2C,stop,23.6799619,8.00E-09,,,,
I2C,start,23.6799126,8.00E-09,,,,
I2C,address,23.6799138,8.40E-06,TRUE,0x74,FALSE,
I2C,data,23.6799239,8.40E-06,TRUE,,,0x02
I2C,start,23.6799367,8.00E-09,,,,
I2C,address,23.6799409,8.40E-06,TRUE,0x74,TRUE,
I2C,data,23.6799509,8.40E-06,FALSE,,,0xB2
I2C,stop,23.6799619,8.00E-09,,,,

然后你就可以阅读了：

import pandas as pd
df=pd.read_csv('ic2.csv')

获取这样的索引数据框：

name    type    start_time  duration    ack     address     read    data
0   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN
1   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN
2   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02
3   I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN
4   I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN
5   I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2
6   I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN
7   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN
8   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN
9   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02
10  I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN
11  I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN
12  I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2
13  I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN

然后你可以生成组，例如：

# find indices of type stop
indstop=df.index[df['type']=='stop'].to_list()

# set new column group flagging the stops
df.loc[df['type']=='stop', 'group'] = indstop

# assign group to the remaining rows
df['group'].fillna(method='bfill', inplace=True)

现在你的 df 看起来像这样：

name    type    start_time  duration    ack     address     read    data    group
0   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN     6.0
1   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN     6.0
2   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02    6.0
3   I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN     6.0
4   I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN     6.0
5   I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2    6.0
6   I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN     6.0
7   I2C     start   23.679913   8.000000e-09    NaN     NaN     NaN     NaN     13.0
8   I2C     address     23.679914   8.400000e-06    True    0x74    False   NaN     13.0
9   I2C     data    23.679924   8.400000e-06    True    NaN     NaN     0x02    13.0
10  I2C     start   23.679937   8.000000e-09    NaN     NaN     NaN     NaN     13.0
11  I2C     address     23.679941   8.400000e-06    True    0x74    True    NaN     13.0
12  I2C     data    23.679951   8.400000e-06    False   NaN     NaN     0xB2    13.0
13  I2C     stop    23.679962   8.000000e-09    NaN     NaN     NaN     NaN     13.0

您现在可以按组进行聚合，例如：

df.groupby('group').agg({'duration':'sum'})


    duration
group   
6.0     0.000034
13.0    0.000034

【讨论】：

但分组后我想解析每一行，将数据与 json 进行比较并获得可读的输出
我明白了，这是一项相当复杂的工作。一旦您可以做的是将 json 文件带到数据框，比较然后将结果转换为 json，正如 cmets 中的 Nk03 所提到的，df.to_dict() 也会有所帮助。但是，如果按照您的原始问题逐行进行，那么 itertuples 将比 iterrows 快；这里有一篇很好的文章：medium.com/swlh/…

【解决方案3】：

事实证明，itertuples() 总是比 iterrows() 快。

了解速度差异的起点是通过分析器运行这两种解决方案。分析器是一种工具，它将执行给定的代码，同时跟踪每个函数的调用次数及其执行时间。这样，您就可以开始优化过程，将注意力集中在最耗时的函数上。

Python 带有一个可以方便地调用的内置分析器来自使用 %%prun 单元魔法的 Jupyter 笔记本。

参考：https://docs.python.org/3/library/profile.html

【讨论】：