熊猫：条件转移答案

【问题标题】：Pandas: conditional shift熊猫：条件转移
【发布时间】：2018-07-16 17:34:56
【问题描述】：

有一种方法可以根据其他两列的条件移动数据框列吗？类似：

df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])

我已经想出了一个办法，但是效率很低：

1)加载数据并创建要移位的列

df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)

输出：

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5            18
1     1 2017-01-02 2017-02-01      6             6
2     1 2017-02-03 2017-02-05      7            13
3     1 2017-02-07 2017-04-01      3            21
4     1 2017-09-07 2017-09-11      1            22
5     2 2018-01-01 2018-02-01     15            15
6     2 2018-03-01 2018-04-01      3            18

2) 使用自连接和一些过滤器移动列

自连接（内存效率低）df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')

过滤“close_time”

df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]

3)与原始数据集合并：

df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)

输出：

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5           NaN
1     1 2017-01-02 2017-02-01      6           NaN
2     1 2017-02-03 2017-02-05      7           6.0
3     1 2017-02-07 2017-04-01      3          13.0
4     1 2017-09-07 2017-09-11      1          21.0
5     2 2018-01-01 2018-02-01     15           NaN
6     2 2018-03-01 2018-04-01      3          15.0

还有更多的 pandas 方法可以获得相同的结果吗？

编辑：我添加了一条数据线以使案例更清晰。我的目标是在新交易开始时间之前完成所有交易的总和

【问题讨论】：

@Wen 的回答有什么问题吗？似乎在温的回答之后添加了赏金，但我认为温的回答没有任何问题。如果你想要更多或不同的东西，你能详细说明吗？
好的，既然你改变了问题，我正在更新我的答案

标签： python pandas datetime data-analysis

【解决方案1】：

我这里用了一个新的para记录条件df2['close_time']<df2['open_time']

df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)
s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)
s.loc[~(df.New.shift()==True)]=np.nan

df['Cumsum']=s




df

Out[1043]: 
   user  open_time close_time  value    New Cumsum
0     1 2017-01-01 2017-03-01      5  False    NaN
1     1 2017-01-02 2017-02-01      6   True    NaN
2     1 2017-02-03 2017-02-05      7   True      6
3     1 2017-02-07 2017-04-01      3  False     13
4     2 2017-01-01 2017-02-01     15   True    NaN
5     2 2017-03-01 2017-04-01      3    NaN     15

更新：由于操作更新了问题（来自 Gabriel A 的数据）

df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))
df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))
df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New'])  ]
df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))
df.drop(['New','New1','New2'],1)


Out[1376]: 
   user  open_time close_time  value  Yourtarget
0     1 2016-12-30 2016-12-31      1           0
1     1 2017-01-01 2017-03-01      5           1
2     1 2017-01-02 2017-02-01      6           1
3     1 2017-02-03 2017-02-05      7           7
4     1 2017-02-07 2017-04-01      3          14
5     1 2017-09-07 2017-09-11      1          22
6     2 2018-01-01 2018-02-01     15           0
7     2 2018-03-01 2018-04-01      3          15

【讨论】：

谢谢温。我认为您的解决方案很棒，但我的情况稍微复杂一点。我添加了一些编辑以使一切清晰
@riccardonizzolo 我相信，后端逻辑应该是一样的，只是需要打磨:-)
这非常简洁;-)
如果你从我的例子中添加第一个日期，你会发现你的方法没有得到正确的答案。它适用于有条件的转变，但不适用于编辑中的特定情况。
@GabrielA 如果您检查问题的编辑历史记录，您会看到他更新了我的回答适用于第一个版本的问题

【解决方案2】：

我对您的测试用例进行了修改，我认为您应该包括在内。此解决方案会处理您的编辑。

import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df

输入：

   user open_time   close_time  value
0   1   12/30/2016  12/31/2016  1
1   1   1/1/2017    3/1/2017    5
2   1   1/2/2017    2/1/2017    6
3   1   2/3/2017    2/5/2017    7
4   1   2/7/2017    4/1/2017    3
5   1   9/7/2017    9/11/2017   1
6   2   1/1/2018    2/1/2018    15
7   2   3/1/2018    4/1/2018    3

创建要移位的列：

df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df


   user open_time   close_time  value   close_cumsum
0   1   2016-12-30  2016-12-31  1       1
1   1   2017-01-01  2017-03-01  5       19
2   1   2017-01-02  2017-02-01  6       7
3   1   2017-02-03  2017-02-05  7       14
4   1   2017-02-07  2017-04-01  3       22
5   1   2017-09-07  2017-09-11  1       23
6   2   2018-01-01  2018-02-01  15      15
7   2   2018-03-01  2018-04-01  3       18

移位列（解释如下）：

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df


user    open_time   close_time  value   close_cumsum    cumulated_closed_value
0   1   2016-12-30  2016-12-31  1       1               0.0
1   1   2017-01-01  2017-03-01  5       19              1.0
2   1   2017-01-02  2017-02-01  6       7               1.0
3   1   2017-02-03  2017-02-05  7       14              7.0
4   1   2017-02-07  2017-04-01  3       22              14.0
5   1   2017-09-07  2017-09-11  1       23              22.0
6   2   2018-01-01  2018-02-01  15      15              0.0
7   2   2018-03-01  2018-04-01  3       18              15.0

所有这些都是以这样一种方式编写的，它可以在所有用户中完成。我相信如果您一次只关注一个用户，逻辑会更容易。

假设没有事件同时发生。这与将累积总和向下移动一行。
删除与其他事件同时发生的事件。
填写缺失值。使用前向填充。

在您使用它之前，我仍然会彻底测试它。时间间隔很奇怪，有很多边缘情况。

【讨论】：

【解决方案3】：

（注意：@wen 的回答对我来说似乎很好，所以我不确定 OP 是否正在寻找更多或不同的东西。无论如何，这是使用 merge_asof 的另一种方法，它也应该很好用。 )

首先按如下方式重塑数据框：

lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])
           .sort_index().groupby('user').cumsum().reset_index(0) )

df = df.set_index('open_time').sort_index()

“查找”的想法只是按“close_time”排序，然后取一个（分组的）累积总和：

            user  value
close_time             
2017-02-01     1      6
2017-02-05     1     13
2017-03-01     1     18
2017-04-01     1     21
2017-09-11     1     22
2018-02-01     2     15
2018-04-01     2     18

对于“df”，我们只取原始数据帧的一个子集：

            user close_time  value
open_time                         
2017-01-01     1 2017-03-01      5
2017-01-02     1 2017-02-01      6
2017-02-03     1 2017-02-05      7
2017-02-07     1 2017-04-01      3
2017-09-07     1 2017-09-11      1
2018-01-01     2 2018-02-01     15
2018-03-01     2 2018-04-01      3

从这里开始，您只想在概念上合并“user”和“open_time”/“close_time”的两个数据集，但复杂的因素是我们不想在时间上进行精确匹配，而是一种“最近的”匹配。

对于这些类型的合并，您可以使用merge_asof，它是用于各种非精确匹配（包括“最近”、“向后”和“向前”）的绝佳工具。不幸的是，由于包含了 groupby，还需要对用户进行循环，但它仍然是非常简单的代码阅读：

df_merged = pd.DataFrame()

for u in df['user'].unique():
    df_merged = df_merged.append( pd.merge_asof( df[df.user==u],  lookup[lookup.user==u], 
                                                 left_index=True, right_index=True, 
                                                 direction='backward' ) )

df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)

结果：

            user_x close_time  value_x  close_cumsum
open_time                                           
2017-01-01       1 2017-03-01        5           NaN
2017-01-02       1 2017-02-01        6           NaN
2017-02-03       1 2017-02-05        7           6.0
2017-02-07       1 2017-04-01        3          13.0
2017-09-07       1 2017-09-11        1          21.0
2018-01-01       2 2018-02-01       15           NaN
2018-03-01       2 2018-04-01        3          15.0

【讨论】：