Pandas - 生成按值分组的连续行序列（按时间戳）答案

【问题标题】：Pandas - Generate sequences of consecutive rows (by timestamp) over grouped by valuesPandas - 生成按值分组的连续行序列（按时间戳）
【发布时间】：2023-03-24 16:16:01
【问题描述】：

我正在开发一个基于 LSTM 的网络，我需要使用 Pandas 对其他列的值的出现序列进行建模，其中每个序列必须受长度限制。

一个实际的用例，我有多台带有日志的机器，日志标有标题和时间戳（为了示例，t1 < t2 < t3 ...），初始数据帧如下所示：

d = {'timestamp': ['t1', 't2', 't1', 't3', 't2', 't2', 't1'], 
     'machine': ['M1', 'M2', 'M2', 'M1', 'M2', 'M1', 'M3'], 
     'log': ['A', 'B', 'A', 'C', 'A', 'A', 'B']}

df = pd.DataFrame(d)
print(df.head(7))

  timestamp machine log
0        t1      M1   A
1        t2      M2   B
2        t1      M2   A
3        t3      M1   C
4        t2      M2   A
5        t2      M1   A
6        t1      M3   B

我想要得到的是每台机器的序列最大为max_len = 2 的数据框。所需的输出应如下所示：

max_len = 2

  machine sequence
0      M1   [A, A]  # index from original df: [0, 5]
1      M1   [A, C]  # index from original df: [5, 3]
2      M2   [A, A]  # index from original df: [2, 4]
3      M2   [A, B]  # index from original df: [4, 1]
4      M3      [B]  # index from original df: [6]

序列受max_len = 2 限制，其元素按timestamp 排序。

max_len = 3

  machine   sequence
0      M1  [A, A, C]  # index from original df: [0, 5, 3]
1      M2  [A, A, B]  # index from original df: [2, 4, 1]
2      M3        [B]  # index from original df: [6]

序列受max_len = 3 限制，其元素按timestamp 排序。

注意：max_len 参数是序列长度的上限，我将填充短序列（如M3's）以适应 LSTM 要求。

注意 2：我实际上是按 2 列进行分组，但为了使这个示例尽可能少，我只包含了 1 列。

到目前为止我尝试了什么：

到目前为止，我一直在使用PySpark，但是我通过逐步使用F.lag 函数来做错了。这留下了许多无用的部分序列，我无法从中识别出需要填充的短序列，而且这种幼稚的方法很慢而且基本上不是那么好。

w = Window.repartition('machine').orderBy('timestamp')
for i in range(max_len):
   df = df.withColumn(f"log_lag_{i}", F.lag('log', i-1).over(w))

如果使用Pandas 来处理这个问题，我将不胜感激，我已经尝试了很长时间，但失败了。

谢谢！

【问题讨论】：

标签： python pandas data-preprocessing

【解决方案1】：

让我们试试 itertools

import itertools
df=(df.assign(log1=df.groupby('machine')['log'].apply(lambda x: list(sorted(i) for i in (itertools.combinations(x, 2))))# get sorted tuple combinations
              .explode()# Explode them into rows
              .reset_index(drop=True)#Drop index
              .combine_first(df['log'])#Update the new column where there is a null value
              .astype(str)#Convert the lists into string
             ).drop_duplicates(subset=['log','log1'])#drop duplicates
             .drop('timestamp',1)#drop column
   )

   machine log        log1
0      M1   A  ['A', 'C']
1      M2   B  ['A', 'A']
3      M1   C  ['A', 'B']
4      M2   A  ['A', 'B']
5      M1   A  ['A', 'A']
6      M3   B           B

【讨论】：

谢谢你的回答，我在想，你用sorted()的方法对每个序列中的元素进行排序，但是，我需要它们按timestamp排序，这怎么可能修改的？ df.sort_values(by=['machine', 'timestamp']) 的预处理可以解决问题吗？）再次感谢！
你的意思是df=(df.assign(log1=df.groupby('machine')['log'].apply(lambda x: list(sorted(i) for i in (itertools.combinations(x, 2))))# get sorted tuple combinations .explode()# Explode them into rows .reset_index(drop=True)#Drop index .combine_first(df['log'])#Update the new column where there is a null value .astype(str)#Convert the lists into string ).drop_duplicates(subset=['log','log1'])#drop duplicates .drop('timestamp',1)#drop column ).sort_values(['machine', 'log', 'log1'])
我的意思是df.sort_values(by=['machine', 'timestamp']).assign(...，因为您在apply调用中使用的sorted方法默认只会对日志进行排序，我需要将日志按timestamp排序。
你在追df=(df.sort_values(by =['timestamp'], ascending=False).assign(log1=df.groupby('machine')['log'].apply(lambda x: list(sorted(i) for i in (itertools.combinations(x, 2))))# get sorted tuple combinations .explode()# Explode them into rows .reset_index(drop=True)#Drop index .combine_first(df['log'])#Update the new column where there is a null value .astype(str)#Convert the lists into string ).drop_duplicates(subset=['log','log1'])#drop duplicates .drop('timestamp',1) )
非常感谢，为适应我的数据集进行了一些改进，您的解决方案非常有效，再次感谢！