【问题标题】:Calculate nunique() for groupby in pandas计算pandas中groupby的nunique()
【发布时间】:2018-08-24 03:32:36
【问题描述】:

我有一个带有列的数据框:

  1. diff - 注册日期和付款日期之间的差异,以天为单位
  2. country - 用户所在国家/地区
  3. user_id
  4. campaign_id -- 另一个分类列,我们将在 groupby 中使用它

我需要为每个拥有diffcountry+campaign_id 组计算不同用户的数量。 例如,对于country'A'、campaign'abc'和diff7,我需要从country'A'、campaign'abc'和diff中获取不同的用户数 7

我当前的解决方案(如下)工作时间过长

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

也许有更好的方法来做到这一点?

【问题讨论】:

    标签: python pandas pandas-groupby


    【解决方案1】:

    首先使用concatassign 的列表理解将所有连接在一起,然后使用groupbynunique 添加列diff,最后重命名列,如果需要,添加reindex 用于自定义列顺序:

    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    

    【讨论】:

      【解决方案2】:

      下面有一个替代方案,但@jezrael's solution 是最佳选择。

      性能基准测试

      %timeit original(df)  # 149ms
      %timeit jp(df)        # 81ms
      %timeit jez(df)       # 47ms
      
      def original(df):
          result_df = pd.DataFrame()
          for diff in df['diff'].unique():
              tmp_df = df.loc[df['diff']<=diff,:]
              tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
              tmp_df['diff'] = diff
              tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
              result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
      
          return result_df
      
      def jp(df):
      
          result_df = pd.DataFrame()
          lst = []
          lst_append = lst.append
          for diff in df['diff'].unique():
              tmp_df = df.loc[df['diff']<=diff,:]
              tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
              tmp_df['diff'] = diff
              tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
              lst_append(tmp_df)
      
          result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)
      
          return result_df
      
      def jez(df):
          df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
          df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
                    .nunique()
                    .reset_index()
                    .rename(columns={'user_id':'unique_ppl'})
                    .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
          return df2
      

      【讨论】:

        猜你喜欢
        • 2022-01-26
        • 1970-01-01
        • 2020-05-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-11-21
        相关资源
        最近更新 更多