【问题标题】:Pandas: add percentage column熊猫:添加百分比列
【发布时间】:2023-01-26 15:55:53
【问题描述】:

有 pandas DataFrame 为:

print(df)

call_id   calling_number   call_status
1          123             BUSY
2          456             BUSY
3          789             BUSY
4          123             NO_ANSWERED
5          456             NO_ANSWERED
6          789             NO_ANSWERED

在这种情况下,具有不同 call_status 的记录(比如“ERROR”或其他我无法预测的东西)值可能会出现在数据框中。我需要添加一个新列在飞行中对于这样的价值。 我应用了 pivot_table() 函数,得到了我想要的结果:

df1 = df.pivot_table(df,index='calling_number',columns='status_code', aggfunc = 'count').fillna(0).astype('int64')

calling_number    ANSWERED  BUSY   NO_ANSWER  
123               0          1      1
456               0          1      1
789               0          1      1

现在我需要再添加一列,其中包含给定 calling_number 的已接电话的百分比,计算为已接电话与总数的比率。 源数据框 'df' 可能不包含 call_status = 'ANSWERED' 的条目,因此在这种情况下,百分比列自然应该具有零值。

预期结果是:

calling_number    ANSWERED  BUSY   NO_ANSWER  ANS_PERC(%)
    123               0          1      1      0
    456               0          1      1      0
    789               0          1      1      0 

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    使用crosstab

    df1 = pd.crosstab(df['calling_number'], df['status_code'])
    

    或者,如果需要通过 count 函数排除 NaNs,请使用 pivot_table 添加参数 fill_value=0

    df1 = df.pivot_table(df,
                   index='calling_number',
                   columns='status_code', 
                   aggfunc = 'count', 
                   fill_value=0)
    

    然后对于比率除以每行的总和值:

    df1 = df1.div(df1.sum(axis=1), axis=0)
    print (df1)
                    ANSWERED      BUSY  NO_ANSWER
    calling_number                               
    123             0.333333  0.333333   0.333333
    456             0.333333  0.333333   0.333333
    789             0.333333  0.333333   0.333333
    

    编辑:要添加可能不存在的某些类别,请使用DataFrame.reindex

    df1 = (pd.crosstab(df['calling_number'], df['call_status'])
             .reindex(columns=['ANSWERED','BUSY','NO_ANSWERED'], fill_value=0))
    
    df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1['ANSWERED'].sum()).fillna(0)
    print (df1)
    call_status     ANSWERED  BUSY  NO_ANSWERED  ANS_PERC(%)
    calling_number                                          
    123                    0     1            1          0.0
    456                    0     1            1          0.0
    789                    0     1            1          0.0
    

    如果需要每行总数:

    df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
    print (df1)
    call_status     ANSWERED  BUSY  NO_ANSWERED  ANS_PERC(%)
    calling_number                                          
    123                    0     1            1          0.0
    456                    0     1            1          0.0
    789                    0     1            1          0.0
    

    编辑1:

    将一些错误的值替换为 ERROR 的解决方案:

    print (df)
       call_id  calling_number  call_status
    0        1             123          ttt
    1        2             456         BUSY
    2        3             789         BUSY
    3        4             123  NO_ANSWERED
    4        5             456  NO_ANSWERED
    5        6             789  NO_ANSWERED
    
    L = ['ANSWERED', 'BUSY', 'NO_ANSWERED']
    df['call_status'] = df['call_status'].where(df['call_status'].isin(L), 'ERROR')
    print (df)
    0        1             123        ERROR
    1        2             456         BUSY
    2        3             789         BUSY
    3        4             123  NO_ANSWERED
    4        5             456  NO_ANSWERED
    5        6             789  NO_ANSWERED
    df1 = (pd.crosstab(df['calling_number'], df['call_status'])
             .reindex(columns=L + ['ERROR'], fill_value=0))
    
    df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
    print (df1)
    call_status     ANSWERED  BUSY  NO_ANSWERED  ERROR  ANS_PERC(%)
    calling_number                                                 
    123                    0     0            1      1          0.0
    456                    0     1            1      0          0.0
    789                    0     1            1      0          0.0
    

    【讨论】:

    • jezrael,如何将回答百分比添加到 df1?如果 df1 不包含“ANSWERED”列怎么办?
    • @harp1814 - 你能添加预期的输出吗?并且还可以删除示例数据中的一个 ANSWERED 以查看什么意思?
    • jezrael,您将列列表“硬编码”为“reindex(columns=['ANSWERED,'BUSY','NO_ANSWERED'],”。但在可能的情况下我无法预测值。请重新阅读我的问题。
    • @harp1814 - 好的,现在不确定是否理解 - 需要像print (df['status_code'].unique()) 这样的列中的所有可能值。并且可能缺少某个值?因为如果我们不知道值(如果 soemting misisng 与否)怎么可能处理?
    • @harp1814 - 添加了EDIT1,我希望这是需要的。
    【解决方案2】:

    我喜欢 cross_tab 的想法,但我喜欢列操作,因此很容易回头参考:

        # define a function to capture all the other call_statuses into one bucket 
    def tester(x):
        if x not in ['ANSWERED', 'BUSY', 'NO_ANSWERED']:
            return 'OTHER' 
        else:
            return x
        
    #capture the simplified status in a new column
    df['refined_status'] = df['call_status'].apply(tester)
    
    
    #Do the pivot (or cross tab) to capture the sums:
    df1= df.pivot_table(values="call_id", index = 'calling_number', columns='refined_status', aggfunc='count')
    
    #Apply a division to get the percentages:
    df1["TOTAL"] = df1[['ANSWERED', 'BUSY', 'NO_ANSWERED', 'OTHER']].sum(axis=1)
    df1["ANS_PERC"] = df1["ANSWERED"]/df1.TOTAL * 100
    
    print(df1)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-12-23
      • 2022-11-17
      • 1970-01-01
      • 2017-03-05
      • 1970-01-01
      • 2021-12-19
      • 2018-08-08
      • 2021-01-05
      相关资源
      最近更新 更多