熊猫将值与前一行与过滤条件进行比较答案

【问题标题】：Pandas compare value with previous row with filtration condition熊猫将值与前一行与过滤条件进行比较
【发布时间】：2018-08-29 07:54:49
【问题描述】：

我有一个包含员工工资信息的 DataFrame。大约有 900000+ 行。

示例：

+----+-------------+---------------+----------+
|    |   table_num | name          |   salary |
|----+-------------+---------------+----------|
|  0 |      001234 | John Johnson  |     1200 |
|  1 |      001234 | John Johnson  |     1000 |
|  2 |      001235 | John Johnson  |     1000 |
|  3 |      001235 | John Johnson  |     1200 |
|  4 |      001235 | John Johnson  |     1000 |
|  5 |      001235 | Steve Stevens |     1000 |
|  6 |      001236 | Steve Stevens |     1200 |
|  7 |      001236 | Steve Stevens |     1200 |
|  8 |      001236 | Steve Stevens |     1200 |
+----+-------------+---------------+----------+

数据类型：

table_num: string
name: string
salary: float

我需要添加一列，其中包含有关增加\减少的工资水平的信息。我正在使用shift() 函数来比较行中的值。

主要问题在于对整个数据集的所有唯一员工进行过滤和迭代。

在我的脚本中大约需要 3 个半小时。

如何做到更快？

我的脚本：

# giving us only unique combination of 'table_num' and 'name'
    # since there can be same 'table_num' for different 'name'
    # and same names with different 'table_num' appears sometimes

names_df = df[['table_num', 'name']].drop_duplicates()

# then extracting particular name and table_num from Series
for i in range(len(names_df)):    ### Bottleneck of whole script ###    
    t = names_df.iloc[i,[0,1]][0]
    n = names_df.iloc[i,[0,1]][1]

    # using shift() and lambda to check if there difference between two rows 
    diff_sal = (df[(df['table_num']==t)
               & ((df['name']==n))]['salary'] - df[(df['table_num']==t)
                                                 & ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
    df.loc[diff_sal.index, 'inc'] = diff_sal.values

示例输入数据：

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'], 
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'], 
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

样本输出：

+----+-------------+---------------+----------+-------+
|    |   table_num | name          |   salary |   inc |
|----+-------------+---------------+----------+-------|
|  0 |      001234 | John Johnson  |     1200 |     0 |
|  1 |      001234 | John Johnson  |     1000 |    -1 |
|  2 |      001235 | John Johnson  |     1000 |     0 |
|  3 |      001235 | John Johnson  |     1200 |     1 |
|  4 |      001235 | John Johnson  |     1000 |    -1 |
|  5 |      001235 | Steve Stevens |     1000 |     0 |
|  6 |      001236 | Steve Stevens |     1200 |     0 |
|  7 |      001236 | Steve Stevens |     1200 |     0 |
|  8 |      001236 | Steve Stevens |     1200 |     0 |
+----+-------------+---------------+----------+-------+

【问题讨论】：

所以我们只比较它们是否具有相同的 table_num 和 name 对吗？
我们可以假设排序只是基于原始df？
table_num 和 name 的组合足以定义特定的员工。排序很重要。原始（大）df 按日期排序。

标签： python pandas dataframe compare rows

【解决方案1】：

将groupby 与diff 一起使用：

df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0

【讨论】：

【解决方案2】：

将DataFrameGroupBy.diff 与numpy.sign 一起使用，最后转换为integers：

df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
   table_num           name  salary  new
0       1234   John Johnson    1200    0
1       1234   John Johnson    1000   -1
2       1235   John Johnson    1000    0
3       1235   John Johnson    1200    1
4       1235   John Johnson    1000   -1
5       1235  Steve Stevens    1000    0
6       1236  Steve Stevens    1200    0
7       1236  Steve Stevens    1200    0
8       1236  Steve Stevens    1200    0

【讨论】：

【解决方案3】：

shift() 是要走的路，但你应该尽可能避免使用循环。在这里，我们可以利用groupby() 和transform() 的强大功能。检查熊猫docs。

在你的情况下，你可以写：

df.assign(inc=lambda x: x.groupby(['name'])
                      .salary
                      .transform(lambda y: y - y.shift(1))
                      .apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
      )

产量：

    table_num   name       salary   inc
0   001234  John Johnson    1200.0  0
1   001234  John Johnson    1000.0  -1
2   001235  John Johnson    1000.0  0
3   001235  John Johnson    1200.0  1
4   001235  John Johnson    1000.0  -1
5   001235  Steve Stevens   1000.0  0
6   001236  Steve Stevens   1200.0  1
7   001236  Steve Stevens   1200.0  0
8   001236  Steve Stevens   1200.0  0

【讨论】：

【解决方案4】：

我认为您可以搜索术语：“熊猫矢量化”以加快数据帧的操作，对于您的问题，您可以尝试以下方法：

import pandas as pd

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)

【讨论】：

应用 lambda 会破坏任何矢量化的希望，但我也有点卡在最后一部分 :)
我不太了解 pandas :D 并且知道有人会有更好的解决方案
一旦你转向 lambda，你将开始在 python 时间运行
:D 是的，不应该那样做，只是按照答案分组
@phung-duy-phong 我意识到为什么会这样。一些值是数字。在这种情况下，df['temp'] = df['name'].astype('str') + df['table_num'].astype('str') 在几秒钟内完成一个技巧。再次感谢您的帮助:)