迭代数据框中的列表答案

【问题标题】：iterations over list in dataframe迭代数据框中的列表
【发布时间】：2018-07-30 10:59:59
【问题描述】：

我有以下问题：我有一个包含 3 列的数据框：第一个是userID，第二个是invoiceType，第三个是发票的创建时间。

df = pd.read_csv('invoice.csv')
Output: UserID  InvoiceType   CreateTime
         1         a          2018-01-01 12:31:00
         2         b          2018-01-01 12:34:12
         3         a          2018-01-01 12:40:13
         1         c          2018-01-09 14:12:25
         2         a          2018-01-12 14:12:29
         1         b          2018-02-08 11:15:00
         2         c          2018-02-12 10:12:12

我正在尝试为每个用户绘制发票周期。我需要创建 2 个新列，time_diff 和 time_diff_wrt_first_invoice。 time_diff 将代表每个用户的每张发票之间的时间差，time_diff_wrt_first_invoice 将代表所有发票与第一张发票之间的时间差，这对于绘图目的很有趣。这是我的代码：

"""
********** Exploding a variable that is a list in each dataframe cell 

"""
def explode_list(df,x):
  return (df[x].apply(pd.Series)
  .stack()
  .reset_index(level = 1, drop=True)
  .to_frame(x))

"""
  ****** applying explode_list to all the columns ******
"""

def explode_listDF(df):
    exploaded_df = pd.DataFrame()

    for x in df.columns.tolist():
        exploaded_df = pd.concat([exploaded_df, explode_list(df,x)], 
        axis = 1)

    return exploaded_df


 """
   ******** Getting the time difference column in pivot table format
 """
def pivoted_diffTime(df1, _freq=60):

    # _ freq is 1 for minutes frequency
    # _freq is 60 for hour frequency
    # _ freq is 60*24 for daily frequency
    # _freq is 60*24*30 for monthly frequency

    df = df.sort_values(['UserID', 'CreateTime'])

    df_pivot = df.pivot_table(index = 'UserID', 
                         aggfunc= lambda x : list(v for v in x)
                         )

    df_pivot['time_diff'] = [[0]]*len(df_pivot)

    for user in df_pivot.index:

        try:    
           _list = [0]+[math.floor((x - y).total_seconds()/(60*_freq)) 
           for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:], 
           df_pivot.loc[user, 'CreateTime'][:-1])]

           df_pivot.loc[user, 'time_diff'] = _list


        except:
            print('There is a prob here :', user)

    return df_pivot


"""
***** Pipelining the two functions to obtain an exploaded dataframe 
 with time difference ******
"""
def get_timeDiff(df, _frequency):

    df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))

    return df

一旦我有了 time_diff，我就会以这种方式创建 time_diff_wrt_first_variable：

# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] = 
[[0]]*len(df_with_timeDiff)

# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():

 df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])

问题是我有一个包含数十万用户的数据框，而且非常耗时。我想知道是否有更适合我需要的解决方案。

【问题讨论】：

您是否使用过 cumsum() 。见stackoverflow.com/a/39623235/461887
我在第二个问题中使用了 cumsum，但我认为它可能更适合聚合方法，而不是我所做的 for 循环。谢谢您的回答。但是对于第一个问题，为了创建time_diff 列，我正在创建一个新变量，其中，对于每个用户，第一个值为 0，第二个值为 t2-t1，第三个为 t3-t2， ...

标签： python list pandas pivot-table

【解决方案1】：

查看 .loc[] 以获得熊猫。

    df_1 = pd.DataFrame(some_stuff)

    df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']

您可以访问特定列，运行循环以检查某些类型的条件，如果您在条件后添加逗号并输入特定列名，它将仅返回该列。我不能 100% 确定这是否回答了您要问的任何问题，因为我实际上没有看到任何问题，但似乎您运行了很多 for 循环和东西来隔离列，这就是 .loc[]是为了。

【讨论】：

感谢您的回答。我知道 .loc，问题是我需要为每个用户分别做所有事情，这就是为什么我有所有这些 for 循环。我认为pivot_table 非常耗时。
您是否想查看用户访问的频率？还是？
不知道他们出现的频率。更准确地说，我有一个发票系统，用户可以在其中获得发票，无论是续订、修改、取消、订阅……每张发票都有其创建时间。我的主要目标是创建一个名为 time_diff 的新列，它将为每个用户检测每张发票之间的时间差。

【解决方案2】：

我找到了更好的解决方案。这是我的代码：

def next_diff(x):
   return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])


def create_timediff(df):

   df.sort_values(['UserID', 'CreateTime'], inplace=True)
   a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
   b = a.apply(np.cumsum)

   a = a.reset_index()
   b = b.reset_index()

   # Here I explode the lists inside the cell
   rows1= []
   _ = a.apply(lambda row: [rows1.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)
   rows2 = []
   __ = b.apply(lambda row: [rows2.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)

   df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
   df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])

   df = df.set_index('UserID')
   df['time_diff']= df1_new['CreateTime']
   df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
   df.reset_index(inplace=True)

   return df

【讨论】：