【问题标题】:Pandas Dataframe Processing is Very SlowPandas 数据帧处理速度非常慢
【发布时间】:2018-03-20 03:43:52
【问题描述】:

我正在处理一个数据集,其中包含一个日期时间列和一个我感兴趣的变量。我想做的是将数据分组为 15 分钟组,所以我编写了以下代码,它基本上计算了一个较低的和日期上限并创建一个间隔为 15 分钟的日期时间对象列表。然后在每对日期时间对象之间对我感兴趣的变量求和,并将总和放入一个新的数据框中。但它运行得很慢(处理 75000 行大约需要五个小时),我不知道为什么。谁能指出代码有什么问题?

here 是一个小样本数据,如果您想自己测试代码。

def create_sales_with_intervals(df, tank_id_col='tank_id'): 
    tank_id = df.iloc[0][tank_id_col]
    tank_dates = get_date_range(df)
    tank_sales =[]

    for idx in tnrange(len(tank_dates) - 1):
        t1 = tank_dates[idx]
        t2 = tank_dates[idx+1]

        sales = get_sales_between(df, t1, t2)

        row={}
        row['start_date'] = t1
        row['end_date'] = t2
        row['total_sale'] = sales
        row['tank_id'] = tank_id
        tank_sales.append(row)

    return pd.DataFrame(tank_sales, columns=['tank_id', 'start_date', 'end_date', 'total_sale'])


def get_date_range(df_tank, date_col='date_time', freq='15MIN'):
    start_date = df_tank.iloc[0][date_col]
    end_date = df_tank.iloc[-1][date_col]

    lower_bound = find_interval(start_date, 'lower')
    upper_bound = find_interval(end_date, 'upper')

    start_date_rounded = round_time(start_date, lower_bound) # Rounds the minute portion of the datetime object to nearest lower bound (0, 15, 30 , 45)
    end_date_rounded = round_time(end_date, upper_bound) # Rounds the minute portion of the datetime object to nearest upper bound (0, 15, 30 , 45)

    tank_dates = pd.date_range(start_date_rounded, end_date_rounded, freq=freq)
    return tank_dates

def get_sales_between(df, t1, t2, date_col='date_time', sale_col='sold'):
    cond1 = df[df[date_col] > t1]
    cond2 = df[df[date_col] < t2]

    idx = cond1.index & cond2.index
    total_sale = df.loc[idx.values][sale_col].sum()
    return total_sale

【问题讨论】:

  • 能否提供一些数据用于测试/基准测试?我们不需要全部 75000 行,但 10 行可能就足够了。
  • here 是一个样本数据。只需将其复制并粘贴到文本编辑器中并保存为 csv
  • 抱歉,我们将问题/答案视为所有用户的资源。因此,请在问题本身中发布一些具有代表性的数据,而不是作为链接(或图像)。
  • 所以我添加了一个指向相关示例数据集的链接。我会附上文件,但我认为 SO 不支持文件上传。这是一个关于上传 csv 文件的answer,上面写着上传文件并分享相关链接。
  • 好吧,很遗憾我无法访问上面的链接。祝你好运!

标签: python pandas


【解决方案1】:

当您有DatetimeIndex 时,请考虑以下使用pd.DataFrame.resample() 方法的情况:

# your sample dataframe
df = pd.DataFrame(
    {
        'date_time': {0: '2015-01-02 23:18:00',
                      1: '2015-01-03 01:00:00',
                      2: '2015-01-03 02:42:00',
                      3: '2015-01-03 04:24:00',
                      4: '2015-01-03 06:06:00',
                      5: '2015-01-03 07:48:00',
                      6: '2015-01-03 09:30:00',
                      7: '2015-01-03 11:12:00',
                      8: '2015-01-03 12:54:00',
                      9: '2015-01-03 14:36:00'},
         'sold': {0: 78.3,
                  1: 0.0,
                  2: 112.9,
                  3: 13.8,
                  4: 32.0,
                  5: 95.1,
                  6: 56.4,
                  7: 28.3,
                  8: 0.0,
                  9: 0.0},
         'tank_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
    })


df
Out[3]: 
             date_time  sold  tank_id
0  2015-01-02 23:18:00  78.3        1
1  2015-01-03 01:00:00     0        1
2  2015-01-03 02:42:00 112.9        1
3  2015-01-03 04:24:00  13.8        1
4  2015-01-03 06:06:00    32        1
5  2015-01-03 07:48:00  95.1        1
6  2015-01-03 09:30:00  56.4        1
7  2015-01-03 11:12:00  28.3        1
8  2015-01-03 12:54:00     0        1
9  2015-01-03 14:36:00     0        1

# convert your timestamps to `pd.Timestamp` objects
df['date_time'] = pd.to_datetime(df['date_time'])

# give the dataframe a `DatetimeIndex`
df.set_index('date_time', inplace=True)

df
Out[6]: 
                     sold  tank_id
date_time                         
2015-01-02 23:18:00  78.3        1
2015-01-03 01:00:00     0        1
2015-01-03 02:42:00 112.9        1
2015-01-03 04:24:00  13.8        1
2015-01-03 06:06:00    32        1
2015-01-03 07:48:00  95.1        1
2015-01-03 09:30:00  56.4        1
2015-01-03 11:12:00  28.3        1
2015-01-03 12:54:00     0        1
2015-01-03 14:36:00     0        1

# resample the `sold` column in 15 minuTe chunks and then sum each chunk
df['sold'].resample('15T').sum()
Out[8]: 
date_time
2015-01-02 23:15:00    78.3
2015-01-02 23:30:00       0
2015-01-02 23:45:00       0
2015-01-03 00:00:00       0
2015-01-03 00:15:00       0
2015-01-03 00:30:00       0
2015-01-03 00:45:00       0
2015-01-03 01:00:00       0
2015-01-03 01:15:00       0
2015-01-03 01:30:00       0
2015-01-03 01:45:00       0
2015-01-03 02:00:00       0
2015-01-03 02:15:00       0
2015-01-03 02:30:00   112.9
2015-01-03 02:45:00       0
2015-01-03 03:00:00       0
2015-01-03 03:15:00       0
2015-01-03 03:30:00       0
2015-01-03 03:45:00       0
2015-01-03 04:00:00       0
2015-01-03 04:15:00    13.8
2015-01-03 04:30:00       0
2015-01-03 04:45:00       0
2015-01-03 05:00:00       0
2015-01-03 05:15:00       0
2015-01-03 05:30:00       0
2015-01-03 05:45:00       0
2015-01-03 06:00:00      32
2015-01-03 06:15:00       0
2015-01-03 06:30:00       0

2015-01-03 07:15:00       0
2015-01-03 07:30:00       0
2015-01-03 07:45:00    95.1
2015-01-03 08:00:00       0
2015-01-03 08:15:00       0
2015-01-03 08:30:00       0
2015-01-03 08:45:00       0
2015-01-03 09:00:00       0
2015-01-03 09:15:00       0
2015-01-03 09:30:00    56.4
2015-01-03 09:45:00       0
2015-01-03 10:00:00       0
2015-01-03 10:15:00       0
2015-01-03 10:30:00       0
2015-01-03 10:45:00       0
2015-01-03 11:00:00    28.3
2015-01-03 11:15:00       0
2015-01-03 11:30:00       0
2015-01-03 11:45:00       0
2015-01-03 12:00:00       0
2015-01-03 12:15:00       0
2015-01-03 12:30:00       0
2015-01-03 12:45:00       0
2015-01-03 13:00:00       0
2015-01-03 13:15:00       0
2015-01-03 13:30:00       0
2015-01-03 13:45:00       0
2015-01-03 14:00:00       0
2015-01-03 14:15:00       0
2015-01-03 14:30:00       0
Freq: 15T, Name: sold, Length: 62, dtype: float64

您可以在pandas 文档here 中找到更多信息。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-12-04
    • 2019-04-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-10-25
    • 2016-11-08
    相关资源
    最近更新 更多