【问题标题】:Pandas groupby and then count the occurrence of 0Pandas groupby 然后统计出现的次数为0
【发布时间】:2020-06-20 18:16:13
【问题描述】:

从这个表中,我尝试通过数据框中可用的最小/最大每周日期来插入缺失的日期。然后,计算每个类别出现 0 次销售。

df=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','ccc','ccc'],
                 'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26','2015-01-12', '2015-01-19', '2015-01-26','2015-01-05', '2015-01-12'],
                 'sales': [0,20,30,10,45,0,47,0,10]})

第一步:将缺失的每周日期添加到所有类别,并将缺失的日期填入 0(Q1:我不确定如何获得这个 df_add_missing_dates 结果)

# expected dates interpolation output
df_add_missing_dates=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','bbb','ccc','ccc','ccc','ccc'],
                                   'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
                                            '2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
                                            '2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26'],
                                   'sales': [0,20,30,10,
                                             0,45,0,47,
                                             0,10,0,0]})

第二步:统计每周销售额为0的发生次数(Q2:如何汇总每个类别的销售额=0?)

# expected final output
category_id | sales_0_count
aaa         | 1
bbb         | 2
ccc         | 3

当前代码和逻辑:

# convert string to datetime and set as index
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
# find min/max weekly dates in the dataframe --> I couldn't add missing dates with 0 sales though
idx = pd.period_range(start=df.week.min(),end=df.week.max(),freq='W')
df = df.reindex(idx, fill_value=0).reset_index(drop=True)
df_add_missing_dates = df
# group by category to count how many times weekly sales is 0 

【问题讨论】:

    标签: python pandas datetime pandas-groupby


    【解决方案1】:

    IIUC,您可以将pd.MultiIndex.from_productsreindexfill_value = 0 一起使用,然后使用布尔矩阵和groupbysum

    idx = pd.MultiIndex.from_product([df['category_id'].unique(), 
                                      df['week'].unique()], 
                                     names=['category_id', 'week'])
    df_missing = (df.set_index(['category_id', 'week'])
                    .reindex(idx, fill_value=0)
                    .reset_index())
    df_missing
    

    输出:

       category_id        week  sales
    0          aaa  2015-01-05      0
    1          aaa  2015-01-12     20
    2          aaa  2015-01-19     30
    3          aaa  2015-01-26     10
    4          bbb  2015-01-05      0
    5          bbb  2015-01-12     45
    6          bbb  2015-01-19      0
    7          bbb  2015-01-26     47
    8          ccc  2015-01-05      0
    9          ccc  2015-01-12     10
    10         ccc  2015-01-19      0
    11         ccc  2015-01-26      0
    

    现在,分组和求和:

    (df_missing == 0).groupby(df_missing['category_id'])['sales'].sum()
    

    输出:

    category_id
    aaa    1.0
    bbb    2.0
    ccc    3.0
    Name: sales, dtype: float64
    

    【讨论】:

      【解决方案2】:

      这将以粗略的方式为您提供预期的输出:

      df_add_missing_dates[df_add_missing_dates.sales.eq(0)].groupby('category_id')['sales'].count()
      

      如果您想要您期望的实际数据帧(尽管这可以做得更好):

      expected_output = df_add_missing_dates[df_add_missing_dates.sales.eq(0)].\
          groupby('category_id',as_index=False)['sales'].count().\
          rename({'sales':'sales_0_count'},axis=1)
      

      【讨论】:

      • 感谢您的建议!我不知道如何从 df 创建 df_add_missing_dates 表:(
      【解决方案3】:

      我是这样做的:

      dfz = df_add_missing_dates[df_add_missing_dates['sales']==0]
      g = dfz.groupby(pd.Grouper(key='category_id'))
      g['sales'].count()
      
      
      category_id
      aaa    1
      bbb    2
      ccc    3
      Name: sales, dtype: int64
      

      【讨论】:

        【解决方案4】:

        不确定重新索引部分的用途,但在

        之后
        df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
        

        你可以这样做:

        groupedDf = df.groupby(['category_id', pd.Grouper(key='week', freq='W-MON')])['sales'].sum().reset_index().sort_values('week')
        
        zeroSalesWeek = groupedDf[groupedDf.sales == 0]
        

        输出:

        zeroSalesWeek
        
            category_id   week        sales
        0   aaa           2015-01-05    0
        4   bbb           2015-01-05    0
        8   ccc           2015-01-05    0
        6   bbb           2015-01-19    0
        10  ccc           2015-01-19    0
        11  ccc           2015-01-26    0
        

        您可以尝试选择特定的 category_id:

        df[(df.sales == 0) & (df.category_id=='bbb')]
        

        这会给你

            category_id   week        sales
        4   bbb           2015-01-05    0
        6   bbb           2015-01-19    0
        

        此外,如果您认为这可能有点太耗时,您可以随时创建一个快速函数来选择特定的 category_id,例如:

        def zeroGroupedDf(df, category_id):
            category_id = str(category_id)
            tempDf = df[(df.sales == 0) & (df.category_id==category_id)]
            return tempDf
        

        并调用您想要创建新 df 的任何 category_id,例如:

        test = zeroGroupedDf(df, 'bbb')
        test
        
            category_id   week        sales
        4   bbb           2015-01-05    0
        6   bbb           2015-01-19    0
        

        【讨论】:

        • 嗨@Gorlomi 我尝试了你的解决方案,但我无法从上面的 Grouper 和 groupby 代码中得到 "bbb"、"2015-01-05"、0
        • 嗨@Blair,当您说“无法获得”时,您的意思是无法选择?如果是这样,我修改了答案。
        猜你喜欢
        • 2020-09-11
        • 2015-12-13
        • 1970-01-01
        • 2022-11-23
        • 2018-12-08
        • 2019-08-11
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多