【问题标题】:How to summarise data by percentages in pandas如何在熊猫中按百分比汇总数据
【发布时间】:2023-03-03 22:45:02
【问题描述】:

这段代码:

  #Missing analysis for actions - which action is missing the most action_types?
    grouped_missing_analysis = pd.crosstab(clean_sessions.action_type, clean_sessions.action, margins=True).unstack()
    grouped_unknown = grouped_missing_analysis.loc(axis=0)[slice(None), ['Missing', 'Unknown', 'Other']]
    print(grouped_unknown)

导致打印:

action                        action_type
10                            Missing              0
                              Unknown              0
11                            Missing              0
                              Unknown              0
12                            Missing              0
                              Unknown              0
15                            Missing              0
                              Unknown              0
about_us                      Missing              0
                              Unknown            416
accept_decline                Missing              0
                              Unknown              0
account                       Missing              0
                              Unknown           9040
acculynk_bin_check_failed     Missing              0
                              Unknown              1
acculynk_bin_check_success    Missing              0
                              Unknown             51
acculynk_load_pin_pad         Missing              0
                              Unknown             50

我现在如何将每个操作的总 MissingUnknownOther 汇总为每个操作的总值计数,并以 All action_types 的百分比表示,它们是 Missing、@ 987654329@ 还是Other?例如,每个操作都有一行,about_us 行将有 406+0/Total Missing + Unknown + Other 用于所有操作。

有关上下文,请参阅this question

问题是上面的代码在它的底部有一行叫做All,它是所有东西的总和,所以:

All                           Missing        1126204
                              Unknown        1031170

期望的输出是:

action                        percent_total_missing_action_type
10                            0
11                            0
12                            0
15                            0
about_us                     416/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
accept_decline                0
account                       9040/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
acculynk_bin_check_failed     1/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
etc..

这是一些测试数据:

action                        action_type
    a                            Missing              2
                                 Unknown              5
    b                            Missing              3
                                 Unknown              4
    c                            Missing              5
                                 Unknown              6
    d                            Missing              1
                                 Unknown              9
    All                          Missing             11
                                 Unknown             24

这应该是什么:

     action                        action_type_percentage
    a                            Missing              2/11
                                 Unknown              5/24
    b                            Missing              3/11
                                 Unknown              4/24
    c                            Missing              5/11
                                 Unknown              6/24
    d                            Missing              1/11
                                 Unknown              9/24
    All                          Missing             11/11
                                 Unknown             24/24

【问题讨论】:

  • 也许可以帮助print grouped_unknown / grouped_unknown.groupby(level=0).transform(sum) 或者也许print grouped_unknown * 100 / grouped_unknown.groupby(level=0).transform(sum)
  • 我之前试过这个,这对每个动作的百分比不是整体的,所以关于我们显示Missing的百分比为0,Unknown的百分比为100,而不是可能显示0.003,about_us Missing 和 Unknown 占整个数据帧中 Missing 和 Unknown 总数的百分比。
  • 我不明白 - 你需要删除行 All?然后你可以从pd.crosstab(clean_sessions.action_type, clean_sessions.action, margins=True)中删除margins=True
  • @jezrael 对此有何想法?
  • 对不起@Dhruv Ghulati,但我为你创建了测试数据,我想要这个样本数据的所需输出。而不是他们,您可以创建真实数据的输出。但是使用这些真实数据更成问题,因为我没有你的真实数据。但是,如果您愿意,您可以添加此数据的所需输出示例,我会尽力帮助您:clean_sessions = pd.DataFrame({'user_id': {0: 'd1', 1: 'd1'}, 'action_type': {0: 'a', 1: 'b'}, 'secs_elapsed': {0: 319, 1: 67753}, 'device_type': {0: 'w', 1: 'w'}, 'action_detail': {0: 'b', 1: 'c'}, 'action': {0: 'b', 1: 'c'}})

标签: pandas group-by aggregate pivot-table crosstab


【解决方案1】:

首先你可以通过xs找到Multindex的值和All,然后你可以通过原始Series尝试它。最后你可以reset_index:

print df
action  action_type
a       Missing         2
        Unknown         5
b       Missing         3
        Unknown         4
c       Missing         5
        Unknown         6
d       Missing         1
        Unknown         9
All     Missing        11
        Unknown        24
dtype: int64

print df.xs('All')
Missing    11
Unknown    24
dtype: int64
action  action_type

print df / df.xs('All')
action  action_type
a       Missing        0.181818
        Unknown        0.208333
b       Missing        0.272727
        Unknown        0.166667
c       Missing        0.454545
        Unknown        0.250000
d       Missing        0.090909
        Unknown        0.375000
All     Missing        1.000000
        Unknown        1.000000
dtype: float64
print (df / df.xs('All')).reset_index().rename(columns={0:'action_type_percentage'})
  action action_type  action_type_percentage
0      a     Missing                0.181818
1      a     Unknown                0.208333
2      b     Missing                0.272727
3      b     Unknown                0.166667
4      c     Missing                0.454545
5      c     Unknown                0.250000
6      d     Missing                0.090909
7      d     Unknown                0.375000
8    All     Missing                1.000000
9    All     Unknown                1.000000

【讨论】:

  • 这让我转到views_campaign Missing 0.000000 Unknown 100.000000 views_campaign_rules Missing 0.000000 Unknown 100.000000 webcam_upload Missing 0.000000 Unknown 100.000000 weibo_signup_referral_finish Missing 0.000000 Unknown 100.000000 ,而不是计算每个操作的总缺失和未知数占总缺失和未知数的百分比。见上面的评论。
  • 如上查看所需的输出。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-05-19
  • 2022-01-07
  • 1970-01-01
相关资源
最近更新 更多