Pandas groupby 应用执行缓慢答案

【问题标题】：Pandas groupby apply performing slowPandas groupby 应用执行缓慢
【发布时间】：2016-02-03 13:01:59
【问题描述】：

我正在开发一个涉及大量数据的程序。我正在使用 python pandas 模块来查找我的数据中的错误。这通常工作得非常快。然而，我写的这段代码似乎比它应该的要慢得多，我正在寻找一种方法来加快它。

为了让你们正确测试它，我上传了一段相当大的代码。您应该能够按原样运行它。代码中的 cmets 应该解释我在这里尝试做什么。任何帮助将不胜感激。

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np

# Filling dataframe with data
# Just ignore this part for now, real data comes from csv files, this is an example of how it looks
TimeOfDay_options = ['Day','Evening','Night']
TypeOfCargo_options = ['Goods','Passengers']
np.random.seed(1234)
n = 10000

df = pd.DataFrame()
df['ID_number'] = np.random.randint(3, size=n)
df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n)
df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n)
df['TrackStart'] = np.random.randint(400, size=n) * 900
df['SectionStart'] = np.nan
df['SectionStop'] = np.nan

grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart'])
for index, group in grouped_df:
    if len(group) == 1:
        df.loc[group.index,['SectionStart']] = group['TrackStart']
        df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899

    if len(group) > 1:
        track_start = group.loc[group.index[0],'TrackStart']
        track_end = track_start + 899
        section_stops = np.random.randint(track_start, track_end, size=len(group))
        section_stops[-1] = track_end
        section_stops = np.sort(section_stops)
        section_starts = np.insert(section_stops, 0, track_start)

        for i,start,stop in zip(group.index,section_starts,section_stops):
            df.loc[i,['SectionStart']] = start
            df.loc[i,['SectionStop']] = stop

#%% This is what a random group looks like without errors
#Note that each section neatly starts where the previous section ended
#There are no gaps (The whole track is defined)
grouped_df.get_group((2, 'Night', 'Passengers', 323100))

#%% Introducing errors to the data
df.loc[2640,'SectionStart'] += 100
df.loc[5390,'SectionStart'] += 7

#%% This is what the same group looks like after introducing errors 
#Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640
#This track now has a gap of 100, it is not completely defined from start to end
grouped_df.get_group((2, 'Night', 'Passengers', 323100))

#%% Try to locate the errors
#This is the part of the code I need to speed up

def Full_coverage(group):
    if len(group) > 1:
        #Sort the grouped data by column 'SectionStart' from low to high

        #Updated for newer pandas version
        #group.sort('SectionStart', ascending=True, inplace=True)
        group.sort_values('SectionStart', ascending=True, inplace=True)

        #Some initial values, overwritten at the end of each loop  
        #These variables correspond to the first row of the group
        start_km = group.iloc[0,4]
        end_km = group.iloc[0,5]
        end_km_index = group.index[0]

        #Loop through all the rows in the group
        #index is the index of the row
        #i is the 'SectionStart' of the row
        #j is the 'SectionStop' of the row
        #The loop starts from the 2nd row in the group
        for index, (i, j) in group.iloc[1:,[4,5]].iterrows():

            #The start of the next row must be equal to the end of the previous row in the group
            if i != end_km: 

                #Add the faulty data to the error list
                incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                    'Found startpoint: '+str(i)+' (row '+str(index)+')'))                

            #Overwrite these values for the next loop
            start_km = i
            end_km = j
            end_km_index = index

    return group

#Check if the complete track is completely defined (from start to end) for each combination of:
    #'ID_number','TimeOfDay','TypeOfCargo','TrackStart'
incomplete_coverage = [] #Create empty list for storing the error messages
df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))

#Print the error list
print('\nFound incomplete coverage in the following rows:')
for i,j in incomplete_coverage:
    print(i)
    print(j)
    print() 

#%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc
%timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))

【问题讨论】：

您是否尝试过使用分析器查看瓶颈在哪里？
瓶颈似乎是应用函数，即使我删除了函数中的 for 循环，它仍然很慢（每个循环约 4.25 秒）。我想知道是否有另一种方法来应用该功能（没有应用命令）。我使用 agg 命令对此代码中的数据执行了一些其他过程。这工作得更快，但我不知道是否可以使用 agg 命令执行此检查（full_coverage）。
瓶颈肯定在你应用的函数中。您的数据中有超过 5300 个不同的组。只需在 5300 个群组上调用 sort 就需要几秒钟。然后迭代这 5300 个组中的每个组中的所有值将需要几秒钟。我建议删除for 循环以支持矢量化操作——您可以使用这种策略将运行时间缩短到~2-3 秒。如果这仍然太慢，那么您需要弄清楚如何在不对每个组中的数据进行排序的情况下执行此操作。

标签： python python-3.x pandas

【解决方案1】：

我认为，问题在于您的数据有 5300 个不同的组。因此，您的功能中的任何缓慢都会被放大。您可能可以在函数中使用矢量化操作而不是 for 循环来节省时间，但更简单的节省时间的方法是使用 return 0 而不是 return group。当您return group 时，pandas 实际上会创建一个新的数据对象，并结合您的排序组，而您似乎没有使用它。当你return 0 时，pandas 会合并 5300 个零，这样更快。

例如：

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
groups = df.groupby(cols)
print(len(groups))
# 5353

%timeit df.groupby(cols).apply(lambda group: group)
# 1 loops, best of 3: 2.41 s per loop

%timeit df.groupby(cols).apply(lambda group: 0)
# 10 loops, best of 3: 64.3 ms per loop

仅结合您不使用的结果大约需要 2.4 秒；其余时间是循环中的实际计算，您应该尝试对其进行矢量化。

编辑：

通过在 for 循环之前进行快速的额外矢量化检查并返回 0 而不是 group，我将时间缩短到大约 2 秒，这基本上是对每个组进行排序的成本。试试这个功能：

def Full_coverage(group):
    if len(group) > 1:
        group = group.sort('SectionStart', ascending=True)

        # this condition is sufficient to find when the loop
        # will add to the list
        if np.any(group.values[1:, 4] != group.values[:-1, 5]):
            start_km = group.iloc[0,4]
            end_km = group.iloc[0,5]
            end_km_index = group.index[0]

            for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
                if i != end_km:
                    incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                        'Found startpoint: '+str(i)+' (row '+str(index)+')'))                
                start_km = i
                end_km = j
                end_km_index = index

    return 0

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 1.74 s per loop

编辑 2：这是一个示例，其中包含我的建议，将排序移到 groupby 之外并删除不必要的循环。对于给定的示例，删除循环并没有快多少，但如果有很多不完整的，会更快：

def Full_coverage_new(group):
    if len(group) > 1:
        mask = group.values[1:, 4] != group.values[:-1, 5]
        if np.any(mask):
            err = ('Expected startpoint: {0} (row {1}) '
                   'Found startpoint: {2} (row {3})')
            incomplete_coverage.extend([err.format(group.iloc[i, 5],
                                                   group.index[i],
                                                   group.iloc[i + 1, 4],
                                                   group.index[i + 1])
                                        for i in np.where(mask)[0]])
    return 0

incomplete_coverage = []
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
df_s = df.sort_values(['SectionStart','SectionStop'])
df_s.groupby(cols).apply(Full_coverage_nosort)

【讨论】：

哇，我的最后一条评论被我删除了。这个功能让我的时间从 6.6 秒/循环减少到 1.6 秒。惊人的进步！你认为有一种方法可以在循环之外对数据进行排序吗？我一直在测试一些东西，但总是得到错误的结果。
是的 - groupby 保留顺序，因此您可以先按 "SectionStart" 对数据帧进行排序，然后使用上述函数删除排序步骤。对于后跟 groupby 的完整排序，我得到 0.4 秒。
这正是我一直在尝试的，虽然它显着加快了进程，但存储在列表中的结果似乎是错误的。我在运行 groupby 之前尝试了 df.sort('SectionStart', ascending=True, inplace=True) ，但错误列表现在似乎包含 8 个错误。显然我只在数据中引入了两个错误。
已解决，似乎我的随机数据在组内包含一些重复的“SectionStop”值。我使用 df.sort(['SectionStart','SectionStop'], ascending=True, inplace=True) 解决了它。
嗯，您的第二次编辑似乎比我的回答慢了很多，每个循环的运行速度约为 510 毫秒。此外，当发现大量错误时，使用 iloc 和 index 命令会进一步减慢速度。

【解决方案2】：

我发现 pandas 的定位命令（.loc 或 .iloc）也减慢了进度。通过将排序移出循环并在函数开始时将数据转换为 numpy 数组，我得到了更快的结果。我知道数据不再是数据框，但列表中返回的索引可用于查找原始 df 中的数据。

如果有任何方法可以进一步加快进程，我将不胜感激。到目前为止我所拥有的：

def Full_coverage(group):

    if len(group) > 1:
        group_index = group.index.values
        group = group.values

        # this condition is sufficient to find when the loop will add to the list
        if np.any(group[1:, 4] != group[:-1, 5]):
            start_km = group[0,4]
            end_km = group[0,5]
            end_km_index = group_index[0]

            for index, (i, j) in zip(group_index, group[1:,[4,5]]):

                if i != end_km:
                    incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                        'Found startpoint: '+str(i)+' (row '+str(index)+')'))               
                start_km = i
                end_km = j
                end_km_index = index

    return 0

incomplete_coverage = []
df.sort(['SectionStart','SectionStop'], ascending=True, inplace=True)
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 272 ms per loop

【讨论】：

使用np.any 检查，您可能可以完全删除for 循环。在很多不匹配的情况下，这将加快速度。否则，约 5000 个不同组的逻辑约 0.2 秒可能与您希望的一样好。
我一直试图摆脱循环，但我似乎找不到任何方法来为每个错误添加单独的打印行，而不使用循环。你有什么想法吗？