具有出现次数的时间序列滑动窗口答案

【问题标题】：time series sliding window with occurrence counts具有出现次数的时间序列滑动窗口
【发布时间】：2017-12-29 07:27:32
【问题描述】：

我正在尝试计算两个带时间戳的值：

例如：

time    letter
  1     A
  4     B
  5     C
  9     C
  18    B
  30    A
  30    B

我正在划分时间窗口：1+ 30 / 30 那么我想知道每个大小为 1 的时间窗口中有多少 A B C

timeseries  A  B  C
1           1  0  0
2           0  0  0
...
30          1  1  0

这应该给我一个 30 行 3 列的表格：A B C of ocurancess

问题是数据需要很长时间才能分解，因为它每次都遍历所有主表以对数据进行切片，即使数据已经排序

master = mytable  

minimum = master.timestamp.min()
maximum = master.timestamp.max()

window = (minimum + maximum) / maximum

wstart = minimum
wend = minimum + window

concurrent_tasks = []

while ( wstart <= maximum ):
    As = 0
    Bs = 0
    Cs = 0
    for d, row in master.iterrows():
        ttime = row.timestamp
        if ((ttime >= wstart) & (ttime < wend)):
            #print (row.channel)
            if (row.channel == 'A'):
                As = As + 1
            elif (row.channel == 'B'):
                Bs = Bs + 1
            elif (row.channel == 'C'):
                Cs = Cs + 1


    concurrent_tasks.append([m_id, As, Bs, Cs])

    wstart = wstart + window
    wend = wend + window

您能帮我提高性能吗？我想使用map函数，我想防止python每次循环遍历所有循环。

这是大数据的一部分，需要几天时间才能完成？

谢谢

【问题讨论】：

标签： python-2.7 pandas dataframe time-series

【解决方案1】：

有一种更快的方法 - pd.get_dummies():

In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  0  0
30    0  1  0

如果你想通过time“压缩”（分组）它：

In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  1  0

或使用sklearn.feature_extraction.text.CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)

r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
                       index=df['time'].unique(),
                       columns=df['letter'].unique(),
                       default_fill_value=0)

结果：

In [143]: r
Out[143]:
    A  B  C
1   1  0  0
4   0  1  0
5   0  0  1
9   0  0  1
18  0  1  0
30  1  1  0

如果我们要列出从1 到30 的所有times：

In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
    A  B  C
1   1  0  0
2   0  0  0
3   0  0  0
4   0  1  0
5   0  0  1
6   0  0  0
7   0  0  0
8   0  0  0
9   0  0  1
10  0  0  0
11  0  0  0
12  0  0  0
13  0  0  0
14  0  0  0
15  0  0  0
16  0  0  0
17  0  0  0
18  0  1  0
19  0  0  0
20  0  0  0
21  0  0  0
22  0  0  0
23  0  0  0
24  0  0  0
25  0  0  0
26  0  0  0
27  0  0  0
28  0  0  0
29  0  0  0
30  1  1  0

或使用 Pandas 方法：

In [159]: pd.get_dummies(df.set_index('time')['letter']) \
     ...:   .groupby(level=0) \
     ...:   .sum() \
     ...:   .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
     ...:
Out[159]:
      A  B  C
time
1     1  0  0
2     0  0  0
3     0  0  0
4     0  1  0
5     0  0  1
6     0  0  0
7     0  0  0
8     0  0  0
9     0  0  1
10    0  0  0
...  .. .. ..
21    0  0  0
22    0  0  0
23    0  0  0
24    0  0  0
25    0  0  0
26    0  0  0
27    0  0  0
28    0  0  0
29    0  0  0
30    1  1  0

[30 rows x 3 columns]

更新：

时间：

In [163]: df = pd.concat([df] * 10**4, ignore_index=True)

In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop

In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

【讨论】：

fill_value=0 重新索引
@piRSquared，是的，我总是忘记这个参数。非常感谢！ :)
不确定哪个更快，但这应该也可以。 df.set_index('time').letter.str.get_dummies()。可能是你的。
@piRSquared，哇！看看时间 - 我很震惊......没想到会有这样的差异......
这是很好的信息。我会尽量避免使用字符串访问器