【问题标题】:Python/Pandas: extract intervals from a large dataframePython/Pandas:从大型数据框中提取间隔
【发布时间】:2020-06-15 18:17:37
【问题描述】:

我有两个熊猫数据框:

  1. 2000 万行持续时间序列数据,日期时间索引 (df) IMG
  2. 20000 行有两个时间戳 (df_seq) IMG

我想使用第二个Dataframe从第一个中提取所有序列(第一个的所有行在每行2的两个时间戳之间),然后每个序列需要转置为990列,然后是所有序列必须合并到一个新的 DataFrame 中。

因此,新的 DataFrame 有一行,每个序列有 990 列IMG(稍后添加案例行)。

现在我的代码如下所示:

sequences = pd.DataFrame()

for row in df_seq.itertuples(index=True, name='Pandas'):
    sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())

sequences = sequences.reset_index(drop=True)

此代码有效,但速度非常慢 --> 20-25 分钟的执行时间

有没有办法在矢量化操作中重写它?或任何其他方式来提高此代码的性能?

【问题讨论】:

标签: python pandas vectorization


【解决方案1】:

这是一种方法。大数据框是'df',间隔一个称为'intervals':

inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"

intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"), 
                   ("2020-01-01 00:01:20","2020-01-01 00:02:03")], 
                  columns=["start_time", "end_time"])

intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals

t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")

t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]

结果是:

              timestamp    0          start_time            end_time
12  2020-01-01 00:00:12   12 2020-01-01 00:00:12 2020-01-01 00:00:18
13  2020-01-01 00:00:13   13 2020-01-01 00:00:12 2020-01-01 00:00:18
14  2020-01-01 00:00:14   14 2020-01-01 00:00:12 2020-01-01 00:00:18
15  2020-01-01 00:00:15   15 2020-01-01 00:00:12 2020-01-01 00:00:18
16  2020-01-01 00:00:16   16 2020-01-01 00:00:12 2020-01-01 00:00:18
..                  ...  ...                 ...                 ...
119 2020-01-01 00:01:59  119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00  120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01  121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02  122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03  123 2020-01-01 00:01:20 2020-01-01 00:02:03

【讨论】:

    【解决方案2】:

    在上述答案的步骤之后,我添加了一个 groupby 和一个 unstack,结果正是我需要的 df:

    执行时间约为 30 秒!

    完整的代码现在看起来像这样:

    sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
    sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
    sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]
    
    sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
    sequences = sequences.reset_index(drop=True)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-12-17
      • 1970-01-01
      • 2012-04-22
      • 2020-03-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多