【问题标题】:Pandas bin and count熊猫垃圾桶和计数
【发布时间】:2016-08-18 10:18:30
【问题描述】:

我是 Pandas 的新手,请不要太苛刻 ;) 假设我的初始数据框如下所示:

#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)

#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)

我现在根据“大小”对数据框进行分箱,

#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )

导致此输出:

---
(0, 1]
   flag_A flag_B flag_C      size
25  False  False   True  0.091269
40   True   True   True  0.902894
41   True   True   True  0.159964
46  False   True   True  0.494409
53  False   True   True  0.638736
73   True  False   True  0.530348
80   True  False  False  0.669700
88   True   True   True  0.858495
---
(1, 2]
   flag_A flag_B flag_C      size
...

我现在的问题是:如何从这里开始计算每个垃圾箱中每个标志(A、B、C)的真假计数?例如。对于 bin=(0,1],我希望得到类似 N_flag_A_true = 5、N_flag_A_false = 3 等的信息。理想情况下,我希望通过扩展此数据帧或将这些信息汇总到一个新数据帧中。

【问题讨论】:

    标签: python pandas count histogram bin


    【解决方案1】:

    可以通过多索引groupbys来实现,将结果串联起来,unstacking:

    flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
    flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
    flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()
    
    T = pd.concat([flag_A,flag_B],axis=1)
    R = pd.concat([T,flag_C],axis=1)
    R.columns = ['flag_A','flag_B','flag_C']
    R.index.names = [u'Bins',u'Value']
    R = R.unstack('Value')
    

    结果是:

           flag_A       flag_B       flag_C      
    Value   False True   False True   False True 
    Bins                                         
    (0, 1]    3.0   5.0    3.0   5.0    1.0   7.0
    (1, 2]    6.0   8.0    7.0   7.0    5.0   9.0
    (2, 3]    7.0   9.0   11.0   5.0   13.0   3.0
    (3, 4]   15.0  12.0   12.0  15.0   17.0  10.0
    (4, 5]    2.0   8.0    5.0   5.0    7.0   3.0
    (5, 6]    5.0   5.0    3.0   7.0    7.0   3.0
    (6, 7]    1.0   5.0    NaN   6.0    3.0   3.0
    (7, 8]    NaN   2.0    1.0   1.0    NaN   2.0
    (8, 9]    NaN   NaN    NaN   NaN    NaN   NaN
    

    编辑:您可以像这样解析列中的多索引:

    R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']
    

    结果:

            flag_A_F  flag_A_T  flag_B_F  flag_B_T  flag_C_F  flag_C_T
    Bins                                                              
    (0, 1]       3.0       5.0       3.0       5.0       1.0       7.0
    (1, 2]       6.0       8.0       7.0       7.0       5.0       9.0
    (2, 3]       7.0       9.0      11.0       5.0      13.0       3.0
    (3, 4]      15.0      12.0      12.0      15.0      17.0      10.0
    (4, 5]       2.0       8.0       5.0       5.0       7.0       3.0
    (5, 6]       5.0       5.0       3.0       7.0       7.0       3.0
    (6, 7]       1.0       5.0       NaN       6.0       3.0       3.0
    (7, 8]       NaN       2.0       1.0       1.0       NaN       2.0
    (8, 9]       NaN       NaN       NaN       NaN       NaN       NaN
    

    【讨论】:

      【解决方案2】:

      你可以将你的组应用到 DF 然后pd.melt:

      df['group'] = pd.cut(df['size'], bins=bins)
      melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])
      

      这会给你:

            group variable  value
      0    (6, 7]   flag_A  False
      1    (3, 4]   flag_A  False
      2    (4, 5]   flag_A   True
      3    (7, 8]   flag_A   True
      4    (6, 7]   flag_A   True
      5    (1, 2]   flag_A  False
      [...]
      

      然后按列分组并取每组的大小:

      df2 = melted.groupby(['group', 'variable', 'value']).size()
      

      这给了你:

      group   variable  value
      (0, 1]  flag_A    False     3
                        True      5
              flag_B    False     3
                        True      5
              flag_C    False     1
                        True      7
      (1, 2]  flag_A    False     6
                        True      8
              flag_B    False     7
                        True      7
              flag_C    False     5
                        True      9
      (2, 3]  flag_A    False     7
                        True      9
              flag_B    False    11
                        True      5
              flag_C    False    13
                        True      3
              [...]
      

      那么你需要重新塑造你想如何使用它......

      【讨论】:

        猜你喜欢
        • 2018-03-15
        • 1970-01-01
        • 2018-04-29
        • 2017-03-21
        • 1970-01-01
        • 1970-01-01
        • 2017-12-20
        相关资源
        最近更新 更多