【发布时间】:2018-08-29 07:54:49
【问题描述】:
我有一个包含员工工资信息的 DataFrame。大约有 900000+ 行。
示例:
+----+-------------+---------------+----------+
| | table_num | name | salary |
|----+-------------+---------------+----------|
| 0 | 001234 | John Johnson | 1200 |
| 1 | 001234 | John Johnson | 1000 |
| 2 | 001235 | John Johnson | 1000 |
| 3 | 001235 | John Johnson | 1200 |
| 4 | 001235 | John Johnson | 1000 |
| 5 | 001235 | Steve Stevens | 1000 |
| 6 | 001236 | Steve Stevens | 1200 |
| 7 | 001236 | Steve Stevens | 1200 |
| 8 | 001236 | Steve Stevens | 1200 |
+----+-------------+---------------+----------+
数据类型:
table_num: string
name: string
salary: float
我需要添加一列,其中包含有关增加\减少的工资水平的信息。
我正在使用shift() 函数来比较行中的值。
主要问题在于对整个数据集的所有唯一员工进行过滤和迭代。
在我的脚本中大约需要 3 个半小时。
如何做到更快?
我的脚本:
# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)): ### Bottleneck of whole script ###
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values
示例输入数据:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
样本输出:
+----+-------------+---------------+----------+-------+
| | table_num | name | salary | inc |
|----+-------------+---------------+----------+-------|
| 0 | 001234 | John Johnson | 1200 | 0 |
| 1 | 001234 | John Johnson | 1000 | -1 |
| 2 | 001235 | John Johnson | 1000 | 0 |
| 3 | 001235 | John Johnson | 1200 | 1 |
| 4 | 001235 | John Johnson | 1000 | -1 |
| 5 | 001235 | Steve Stevens | 1000 | 0 |
| 6 | 001236 | Steve Stevens | 1200 | 0 |
| 7 | 001236 | Steve Stevens | 1200 | 0 |
| 8 | 001236 | Steve Stevens | 1200 | 0 |
+----+-------------+---------------+----------+-------+
【问题讨论】:
-
所以我们只比较它们是否具有相同的 table_num 和 name 对吗?
-
我们可以假设排序只是基于原始df?
-
table_num 和 name 的组合足以定义特定的员工。排序很重要。原始(大)df 按日期排序。
标签: python pandas dataframe compare rows