使用领先和滞后窗口函数的 SQL 案例语句的 Python Pandas 等效项答案

【问题标题】：Python Pandas equivalent for SQL case statement using lead and lag window function使用领先和滞后窗口函数的 SQL 案例语句的 Python Pandas 等效项
【发布时间】：2019-03-24 22:25:57
【问题描述】：

这里是 Python 新手，想看看是否有更优雅的解决方案。

我有一个带有运动指示器的远程信息处理设备的时间序列数据。我需要将运动指示器扩展到实际运动开始和停止的 +/- 1 行（由下面的 motion2 列表示）。我在 SQL 中使用 case 语句和超前和滞后窗口函数来做这件事。正在尝试将我的代码转换为 python...

这是数据。将熊猫导入为 pd

data = {'device':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2], 
    'time':[1,2,3,4,5,6,7,8,9,10,11,12,5,6,7,8,9,10,11,12,13,14],
    'motion':[0,0,1,1,1,0,0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0]}

df = pd.DataFrame.from_dict(data)
df = df[['device','time','motion']]

##sort data chronologically for each device
df.sort_values(['device','time'], ascending = True, inplace = True)

这就是 df 的样子

device, time, motion
1,1,0
1,2,0
1,3,1
1,4,1
1,5,1
1,6,0
1,7,0
1,8,0
1,9,1
1,10,1
1,11,0
1,12,0
2,5,0
2,6,0
2,7,0
2,8,1
2,9,1
2,10,1
2,11,0
2,12,1
2,13,0
2,14,0

我需要的是下面添加到数据框中的motion2列。

device, time, motion, motion2
1,1,0,0
1,2,0,1
1,3,1,1
1,4,1,1
1,5,1,1
1,6,0,1
1,7,0,0
1,8,0,1
1,9,1,1
1,10,1,1
1,11,0,1
1,12,0,0
2,5,0,0
2,6,0,0
2,7,0,1
2,8,1,1
2,9,1,1
2,10,1,1
2,11,0,1
2,12,1,1
2,13,0,1
2,14,0,0

下面是运行的 python 代码。但是，想知道是否有更优雅的方式。

##create new columns for prior and next motion indicator
df['prev_motion'] = df.groupby(['device'])['motion'].shift(1)
df['next_motion'] = df.groupby(['device'])['motion'].shift(-1)

##create the desired motion2 indicator to expand +/- 1 record of the motion 
start and stop

df['motion2'] = df[['prev_motion', 'motion', 'next_motion']].apply(lambda 
row: 1 if row['motion']==1 else (1 if row['prev_motion']==1 or 
row['next_motion']==1 else 0), axis=1)

##drop unwanted columns        
df.drop(columns=['prev_motion', 'next_motion'], inplace = True)

这在 SQL 中使用 case 语句和窗口函数（领先和滞后）要容易得多。

case 
when motion = 1 then 1
when motion = 0 and (lead(motion) over (partition by device order by time) = 1) then 1
when motion = 0 and (lag(motion) over (partition by device order by time) = 1) then 1
else 0
end as motion2

【问题讨论】：

标签： python sql pandas window-functions case-statement

【解决方案1】：

这不一定是最优雅的，但它确实有效：找到motion 为1 的任何点，或者motion 在任一方向移动1 为1 的任何点。这里有两种使用numpy 函数的方法（注意numpy 函数不需要显式导入numpy，因为它们也内置在pandas 中并且可以通过pd.np 访问，但请参阅@ Abhi 对纯 pandas 等效项的评论）：

df['motion2'] = pd.np.where(df.motion.values|pd.np.roll(df.motion.values,1)|pd.np.roll(df.motion.values,-1),1,0)

# The following is Essentially the equivalent, but maybe a bit clearer / more efficient
df['motion2'] = pd.np.stack((df.motion.values,pd.np.roll(df.motion.values,1),pd.np.roll(df.motion.values,-1))).any(0).astype(int)

>>> df
    device  time  motion  motion2
0        1     1       0        0
1        1     2       0        1
2        1     3       1        1
3        1     4       1        1
4        1     5       1        1
5        1     6       0        1
6        1     7       0        0
7        1     8       0        1
8        1     9       1        1
9        1    10       1        1
10       1    11       0        1
11       1    12       0        0
12       2     5       0        0
13       2     6       0        0
14       2     7       0        1
15       2     8       1        1
16       2     9       1        1
17       2    10       1        1
18       2    11       0        1
19       2    12       1        1
20       2    13       0        1
21       2    14       0        0

【讨论】：

pd.concat([df.motion.shift(-1),df.motion.shift(1),df.motion],axis=1).max(axis=1)
@Abhi，这可行，它与我提出的基本机制基本相同，只是使用pandas 函数而不是numpy。由于使用numpy 函数不需要将其作为单独的包显式导入（可以使用pd.np.whatever_function()... 访问它们），我更喜欢那些，但这是个人喜好:)
很好。我没有经常玩 numpy，所以我通常大部分时间都使用 pandas。 :) +1
是的，很高兴你指出了这一点，我在编辑中引用了你的评论，以防有人强烈反对 numpy 方法。谢谢！
我意识到这不是完全准确的解决方案。我还需要按设备对窗口进行分区。看起来如果设备 1 的最后一个运动行 = 1，设备 2 的第一行 =0，那么即使设备 2 第二行 =0，设备 2 的第一行仍然会得到 motion2=1。您将如何按设备进行分区并仍然执行相同的操作？再次感谢您的帮助！