【问题标题】:How to categorize data based on column values in pandas?如何根据熊猫中的列值对数据进行分类?
【发布时间】:2017-09-10 09:05:08
【问题描述】:

假设我有这个数据框:

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'payout': [.1, .15, .2, .3, 1.2, 1.3, 1.45, 2, 2.04, 3.011, 3.45, 1], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'payout', 'name', 'preTestScore', 'postTestScore'])

现在,我想根据“支出”列构建这些类别:

Cat1 : 0 <= x <= 1
Cat2 : 1 <  x <= 2
Cat3 : 2 <  x <= 3
Cat4 : 3 <  x <= 4

并构建postTestscore列的总和

我是这样做的,使用布尔索引:

df.loc[(df['payout'] > 0) & (df['payout'] <= 1), 'postTestScore'].sum()
df.loc[(df['payout'] > 1) & (df['payout'] <= 2), 'postTestScore'].sum()
etc...

它工作得很好,但是有人知道这个更简洁(pythonic)的解决方案吗?

【问题讨论】:

    标签: python pandas dataframe categories


    【解决方案1】:

    试试pd.cutgroupby

    df.groupby(pd.cut(df.payout, [0, 1, 2, 3, 4])).postTestScore.sum()
    
    payout
    (0, 1]    308
    (1, 2]    246
    (2, 3]     62
    (3, 4]    132
    Name: postTestScore, dtype: int64
    

    【讨论】:

    • @Pythoneer 一个衬垫被高估了,但是是的,它们看起来不错。
    【解决方案2】:

    通过cut 创建类别,然后通过groupby 汇总总和:

    bins = [0,1,2,3,4]
    labels=['Cat{}'.format(x) for x in range(1, len(bins))]
    binned = pd.cut(df['payout'], bins=bins, labels=labels)
    print (binned)
    0     Cat1
    1     Cat1
    2     Cat1
    3     Cat1
    4     Cat2
    5     Cat2
    6     Cat2
    7     Cat2
    8     Cat3
    9     Cat4
    10    Cat4
    11    Cat1
    Name: payout, dtype: category
    Categories (4, object): [Cat1 < Cat2 < Cat3 < Cat4]
    
    df1 = df.groupby(binned)['postTestScore'].sum().reset_index()
    print (df1)
      payout  postTestScore
    0   Cat1            308
    1   Cat2            246
    2   Cat3             62
    3   Cat4            132
    

    同样是一行解决方案:

    df1 = df.groupby(pd.cut(df['payout'], 
                            bins=[0,1,2,3,4], 
                            labels=['Cat1','Cat2','Cat3','Cat4']))['postTestScore'].sum()
    print (df1)
    
    payout
    Cat1    308
    Cat2    246
    Cat3     62
    Cat4    132
    Name: postTestScore, dtype: int64
    

    numpy 的另一个非常快速的解决方案:

    labs = ['Cat{}'.format(x) for x in range(len(bins))]
    a = np.array(labs)[np.array(bins).searchsorted(df['payout'].values)]
    print (a)
    
    ['Cat1' 'Cat1' 'Cat1' 'Cat1' 'Cat2' 'Cat2' 'Cat2' 'Cat2' 'Cat3' 'Cat4'
     'Cat4' 'Cat1']
    
    df1 = df.groupby(a)['postTestScore'].sum().rename_axis('cats').reset_index()
    print (df1)
       cats  postTestScore
    0  Cat1            308
    1  Cat2            246
    2  Cat3             62
    3  Cat4            132
    

    【讨论】:

    • 和我的回答一样,除了标签。
    猜你喜欢
    • 1970-01-01
    • 2021-12-22
    • 2014-04-23
    • 1970-01-01
    • 1970-01-01
    • 2020-06-22
    • 2015-05-11
    • 2021-11-02
    • 2021-02-27
    相关资源
    最近更新 更多