【发布时间】:2018-07-30 10:59:59
【问题描述】:
我有以下问题: 我有一个包含 3 列的数据框: 第一个是userID,第二个是invoiceType,第三个是发票的创建时间。
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
我正在尝试为每个用户绘制发票周期。我需要创建 2 个新列,time_diff 和 time_diff_wrt_first_invoice。 time_diff 将代表每个用户的每张发票之间的时间差,time_diff_wrt_first_invoice 将代表所有发票与第一张发票之间的时间差,这对于绘图目的很有趣。这是我的代码:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
一旦我有了 time_diff,我就会以这种方式创建 time_diff_wrt_first_variable:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
问题是我有一个包含数十万用户的数据框,而且非常耗时。我想知道是否有更适合我需要的解决方案。
【问题讨论】:
-
您是否使用过 cumsum() 。见stackoverflow.com/a/39623235/461887
-
我在第二个问题中使用了 cumsum,但我认为它可能更适合聚合方法,而不是我所做的 for 循环。谢谢您的回答。但是对于第一个问题,为了创建
time_diff列,我正在创建一个新变量,其中,对于每个用户,第一个值为 0,第二个值为 t2-t1,第三个为 t3-t2, ...
标签: python list pandas pivot-table